Add notes for 2022-03-04

This commit is contained in:
2022-03-04 15:30:06 +03:00
parent 7453499827
commit 27acbac859
115 changed files with 6550 additions and 6444 deletions

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -109,11 +109,11 @@
<ul>
<li>Send Gaia the last batch of potential duplicates for items 701 to 980:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4.csv
</span></span><span style="display:flex;"><span>$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p <span style="color:#e6db74">&#39;fuuu&#39;</span> -o /tmp/2022-03-01-tac-batch4-701-980.csv
</span></span><span style="display:flex;"><span>$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &gt; /tmp/tac4-filenames.csv
</span></span><span style="display:flex;"><span>$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2022-03/'>Read more →</a>
</article>
@ -185,13 +185,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
<li>Atmire merged some changes I had submitted to the COUNTER-Robots project</li>
<li>I updated our local spider user agents and then re-ran the list with my <code>check-spider-hits.sh</code> script on CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
<span style="color:#960050;background-color:#1e0010">
</span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
</span></span><span style="display:flex;"><span>Purging 1989 hits from The Knowledge AI in statistics
</span></span><span style="display:flex;"><span>Purging 1235 hits from MaCoCu in statistics
</span></span><span style="display:flex;"><span>Purging 455 hits from WhatsApp in statistics
</span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010">
</span></span></span><span style="display:flex;"><span><span style="color:#960050;background-color:#1e0010"></span>Total number of bot hits purged: 3679
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-12/'>Read more →</a>
</article>
@ -214,9 +214,9 @@ Purging 455 hits from WhatsApp in statistics
<li>I experimented with manually sharding the Solr statistics on DSpace Test</li>
<li>First I exported all the 2019 stats from CGSpace:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ ./run.sh -s http://localhost:8081/solr/statistics -f <span style="color:#e6db74">&#39;time:2019-*&#39;</span> -a export -o statistics-2019.json -k uid
</span></span><span style="display:flex;"><span>$ zstd statistics-2019.json
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-11/'>Read more →</a>
</article>
@ -238,15 +238,15 @@ $ zstd statistics-2019.json
<ul>
<li>Export all affiliations on CGSpace and run them against the latest RoR data dump:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT text_value as &#34;cg.contributor.affiliation&#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>$ csvcut -c <span style="color:#ae81ff">1</span> /tmp/2021-10-01-affiliations.csv | sed 1d &gt; /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
</span></span><span style="display:flex;"><span>ations-matching.csv
</span></span><span style="display:flex;"><span>$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
</span></span><span style="display:flex;"><span>1879
</span></span><span style="display:flex;"><span>$ wc -l /tmp/2021-10-01-affiliations.txt
</span></span><span style="display:flex;"><span>7100 /tmp/2021-10-01-affiliations.txt
</span></span></code></pre></div><ul>
<li>So we have 1879/7100 (26.46%) matching already</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-10/'>Read more →</a>
@ -303,8 +303,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Update Docker images on AReS server (linode20) and reboot the server:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console"># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</code></pre></div><ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span># docker images | grep -v ^REPO | sed <span style="color:#e6db74">&#39;s/ \+/:/g&#39;</span> | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
</span></span></code></pre></div><ul>
<li>I decided to upgrade linode20 from Ubuntu 18.04 to 20.04</li>
</ul>
<a href='https://alanorth.github.io/cgspace-notes/2021-08/'>Read more →</a>
@ -328,9 +328,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
<ul>
<li>Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>localhost/dspace63= &gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
</span></span><span style="display:flex;"><span>COPY 20994
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-07/'>Read more →</a>
</article>

View File

@ -17,11 +17,11 @@
&lt;ul&gt;
&lt;li&gt;Send Gaia the last batch of potential duplicates for items 701 to 980:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4.csv
$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;fuuu&amp;#39;&lt;/span&gt; -o /tmp/2022-03-01-tac-batch4-701-980.csv
$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4-filenames.csv
$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &amp;gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c id,dc.title,dcterms.issued,dcterms.type ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/check-duplicates.py -i /tmp/tac4.csv -db dspace -u dspace -p &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;fuuu&amp;#39;&lt;/span&gt; -o /tmp/2022-03-01-tac-batch4-701-980.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c id,filename ~/Downloads/2022-03-01-CGSpace-TAC-ICW-batch4-701-980.csv &amp;gt; /tmp/tac4-filenames.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &amp;gt; /tmp/2022-03-01-tac-batch4-701-980-filenames.csv
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -66,13 +66,13 @@ $ csvjoin -c id /tmp/2022-03-01-tac-batch4-701-980.csv /tmp/tac4-filenames.csv &
&lt;li&gt;Atmire merged some changes I had submitted to the COUNTER-Robots project&lt;/li&gt;
&lt;li&gt;I updated our local spider user agents and then re-ran the list with my &lt;code&gt;check-spider-hits.sh&lt;/code&gt; script on CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
Purging 1989 hits from The Knowledge AI in statistics
Purging 1235 hits from MaCoCu in statistics
Purging 455 hits from WhatsApp in statistics
&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;&lt;/span&gt;Total number of bot hits purged: 3679
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/check-spider-hits.sh -f /tmp/agents -p
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 1989 hits from The Knowledge AI in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 1235 hits from MaCoCu in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;Purging 455 hits from WhatsApp in statistics
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#960050;background-color:#1e0010&#34;&gt;&lt;/span&gt;Total number of bot hits purged: 3679
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -86,9 +86,9 @@ Purging 455 hits from WhatsApp in statistics
&lt;li&gt;I experimented with manually sharding the Solr statistics on DSpace Test&lt;/li&gt;
&lt;li&gt;First I exported all the 2019 stats from CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
$ zstd statistics-2019.json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./run.sh -s http://localhost:8081/solr/statistics -f &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;time:2019-*&amp;#39;&lt;/span&gt; -a export -o statistics-2019.json -k uid
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ zstd statistics-2019.json
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -101,15 +101,15 @@ $ zstd statistics-2019.json
&lt;ul&gt;
&lt;li&gt;Export all affiliations on CGSpace and run them against the latest RoR data dump:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
ations-matching.csv
$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
1879
$ wc -l /tmp/2021-10-01-affiliations.txt
7100 /tmp/2021-10-01-affiliations.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT text_value as &amp;#34;cg.contributor.affiliation&amp;#34;, count(*) FROM metadatavalue WHERE dspace_object_id IN (SELECT uuid FROM item) AND metadata_field_id = 211 GROUP BY text_value ORDER BY count DESC) to /tmp/2021-10-01-affiliations.csv WITH CSV HEADER;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvcut -c &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; /tmp/2021-10-01-affiliations.csv | sed 1d &amp;gt; /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ ./ilri/ror-lookup.py -i /tmp/2021-10-01-affiliations.txt -r 2021-09-23-ror-data.json -o /tmp/2021-10-01-affili
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;ations-matching.csv
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ csvgrep -c matched -m true /tmp/2021-10-01-affiliations-matching.csv | sed 1d | wc -l
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;1879
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ wc -l /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;7100 /tmp/2021-10-01-affiliations.txt
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;So we have 1879/7100 (26.46%) matching already&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -148,8 +148,8 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Update Docker images on AReS server (linode20) and reboot the server:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;# docker images | grep -v ^REPO | sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;s/ \+/:/g&amp;#39;&lt;/span&gt; | cut -d: -f1,2 | grep -v none | xargs -L1 docker pull
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ul&gt;
&lt;li&gt;I decided to upgrade linode20 from Ubuntu 18.04 to 20.04&lt;/li&gt;
&lt;/ul&gt;</description>
</item>
@ -164,9 +164,9 @@ $ wc -l /tmp/2021-10-01-affiliations.txt
&lt;ul&gt;
&lt;li&gt;Export another list of ALL subjects on CGSpace, including AGROVOC and non-AGROVOC for Enrico:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
COPY 20994
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;localhost/dspace63= &amp;gt; \COPY (SELECT DISTINCT LOWER(text_value) AS subject, count(*) FROM metadatavalue WHERE dspace_object_id in (SELECT dspace_object_id FROM item) AND metadata_field_id IN (119, 120, 127, 122, 128, 125, 135, 203, 208, 210, 215, 123, 236, 242, 187) GROUP BY subject ORDER BY count DESC) to /tmp/2021-07-01-all-subjects.csv WITH CSV HEADER;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;COPY 20994
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -271,17 +271,17 @@ COPY 20994
&lt;li&gt;I had a call with CodeObia to discuss the work on OpenRXV&lt;/li&gt;
&lt;li&gt;Check the results of the AReS harvesting from last night:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
{
&amp;#34;count&amp;#34; : 100875,
&amp;#34;_shards&amp;#34; : {
&amp;#34;total&amp;#34; : 1,
&amp;#34;successful&amp;#34; : 1,
&amp;#34;skipped&amp;#34; : 0,
&amp;#34;failed&amp;#34; : 0
}
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;&#34;&gt;&lt;code class=&#34;language-console&#34; data-lang=&#34;console&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;$ curl -s &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;amp;pretty&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;{
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;count&amp;#34; : 100875,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;_shards&amp;#34; : {
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;total&amp;#34; : 1,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;successful&amp;#34; : 1,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;skipped&amp;#34; : 0,
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; &amp;#34;failed&amp;#34; : 0
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;}
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
</item>
<item>
@ -599,17 +599,17 @@ COPY 20994
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
1277694
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;So 4.6 million from XMLUI and another 1.2 million from API requests&lt;/li&gt;
&lt;li&gt;Let&amp;rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;quot;[0-9]{1,2}/Oct/2019&amp;quot; | grep -c -E &amp;quot;/rest/bitstreams&amp;quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &amp;#34;[0-9]{1,2}/Oct/2019&amp;#34; | grep -c -E &amp;#34;/rest/bitstreams&amp;#34;
106781
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -620,7 +620,7 @@ COPY 20994
<pubDate>Tue, 01 Oct 2019 13:20:51 +0300</pubDate>
<guid>https://alanorth.github.io/cgspace-notes/2019-10/</guid>
<description>2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&amp;rsquo;s &amp;ldquo;unneccesary Unicode&amp;rdquo; fix: $ csvcut -c &#39;id,dc.</description>
<description>2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&amp;rsquo;s &amp;ldquo;unneccesary Unicode&amp;rdquo; fix: $ csvcut -c &amp;#39;id,dc.</description>
</item>
<item>
@ -634,7 +634,7 @@ COPY 20994
&lt;li&gt;Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning&lt;/li&gt;
&lt;li&gt;Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &amp;#34;01/Sep/2019:0&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -645,7 +645,7 @@ COPY 20994
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &amp;quot;01/Sep/2019:0&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &amp;#34;01/Sep/2019:0&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -761,16 +761,16 @@ DELETE 1
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &amp;#39;Spore-192-EN-web.pdf&amp;#39; | grep -E &amp;#39;(18.196.196.108|18.195.78.144|18.195.218.6)&amp;#39; | awk &amp;#39;{print $9}&amp;#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;In the last two weeks there have been 47,000 downloads of this &lt;em&gt;same exact PDF&lt;/em&gt; by these three IP addresses&lt;/li&gt;
&lt;li&gt;Apply country and region corrections and deletions on DSpace Test and CGSpace:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -m 231 -f cg.coverage.region -d
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -808,7 +808,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!&lt;/li&gt;
&lt;li&gt;The top IPs before, during, and after this latest alert tonight were:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;01/Feb/2019:(17|18|19|20|21)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;#34;01/Feb/2019:(17|18|19|20|21)&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -824,7 +824,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
&lt;li&gt;The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase&lt;/li&gt;
&lt;li&gt;There were just over 3 million accesses in the nginx logs last month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;quot;[0-9]{1,2}/Jan/2019&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# time zcat --force /var/log/nginx/* | grep -cE &amp;#34;[0-9]{1,2}/Jan/2019&amp;#34;
3018243
real 0m19.873s
@ -844,7 +844,7 @@ sys 0m1.979s
&lt;li&gt;Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning&lt;/li&gt;
&lt;li&gt;I don&amp;rsquo;t see anything interesting in the web server logs around that time though:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;quot;02/Jan/2019:0(1|2|3)&amp;quot; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &amp;#34;02/Jan/2019:0(1|2|3)&amp;#34; | awk &amp;#39;{print $1}&amp;#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -979,7 +979,7 @@ sys 0m1.979s
&lt;li&gt;I added the new CCAFS Phase II Project Tag &lt;code&gt;PII-FP1_PACCA2&lt;/code&gt; and merged it into the &lt;code&gt;5_x-prod&lt;/code&gt; branch (&lt;a href=&#34;https://github.com/ilri/DSpace/pull/379&#34;&gt;#379&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I proofed and tested the ILRI author corrections that Peter sent back to me this week:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &amp;#39;fuuu&amp;#39; -f dc.contributor.author -t correct -m 3 -n
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in &lt;a href=&#34;https://alanorth.github.io/cgspace-notes/cgspace-notes/2018-03/&#34;&gt;March, 2018&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Time to index ~70,000 items on CGSpace:&lt;/li&gt;
@ -1073,11 +1073,11 @@ sys 2m7.289s
&lt;li&gt;I notice this error quite a few times in dspace.log:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &amp;quot; &amp;quot;]&amp;quot; &amp;quot;] &amp;quot;&amp;quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &amp;#39;dateIssued_keyword:[1976+TO+1979]&amp;#39;: Encountered &amp;#34; &amp;#34;]&amp;#34; &amp;#34;] &amp;#34;&amp;#34; at line 1, column 32.
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;And there are many of these errors every day for the past month:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;quot;Error while searching for sidebar facets&amp;quot; dspace.log.*
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ grep -c &amp;#34;Error while searching for sidebar facets&amp;#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -1155,12 +1155,12 @@ dspace.log.2018-01-02:34
&lt;ul&gt;
&lt;li&gt;Today there have been no hits by CORE and no alerts from Linode (coincidence?)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;quot;CORE&amp;quot; /var/log/nginx/access.log
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# grep -c &amp;#34;CORE&amp;#34; /var/log/nginx/access.log
0
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;Generate list of authors on CGSpace for Peter to go through and correct:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &amp;#39;contributor&amp;#39; and qualifier = &amp;#39;author&amp;#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1289,7 +1289,7 @@ COPY 54701
&lt;li&gt;Remove redundant/duplicate text in the DSpace submission license&lt;/li&gt;
&lt;li&gt;Testing the CMYK patch on a collection with 650 items:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &amp;quot;ImageMagick PDF Thumbnail&amp;quot; -v &amp;gt;&amp;amp; /tmp/filter-media-cmyk.txt
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &amp;#34;ImageMagick PDF Thumbnail&amp;#34; -v &amp;gt;&amp;amp; /tmp/filter-media-cmyk.txt
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1330,7 +1330,7 @@ COPY 54701
&lt;ul&gt;
&lt;li&gt;An item was mapped twice erroneously again, so I had to remove one of the mappings manually:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# select * from collection2item where item_id = &#39;80278&#39;;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspace=# select * from collection2item where item_id = &amp;#39;80278&amp;#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -1370,11 +1370,11 @@ DELETE 1
&lt;li&gt;CGSpace was down for five hours in the morning while I was sleeping&lt;/li&gt;
&lt;li&gt;While looking in the logs for errors, I see tons of warnings about Atmire MQM:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&amp;quot;dc.title&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&amp;quot;THUMBNAIL&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&amp;quot;-1&amp;quot;, transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&amp;quot;TX157907838689377964651674089851855413607&amp;quot;)
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&amp;#34;dc.title&amp;#34;, transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&amp;#34;THUMBNAIL&amp;#34;, transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&amp;#34;-1&amp;#34;, transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&amp;#34;TX157907838689377964651674089851855413607&amp;#34;)
&lt;/code&gt;&lt;/pre&gt;&lt;ul&gt;
&lt;li&gt;I see thousands of them in the logs for the last few months, so it&amp;rsquo;s not related to the DSpace 5.5 upgrade&lt;/li&gt;
&lt;li&gt;I&amp;rsquo;ve raised a ticket with Atmire to ask&lt;/li&gt;
@ -1429,7 +1429,7 @@ DELETE 1
&lt;li&gt;We had been using &lt;code&gt;DC=ILRI&lt;/code&gt; to determine whether a user was ILRI or not&lt;/li&gt;
&lt;li&gt;It looks like we might be able to use OUs now, instead of DCs:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &amp;quot;dc=cgiarad,dc=org&amp;quot; -D &amp;quot;admigration1@cgiarad.org&amp;quot; -W &amp;quot;(sAMAccountName=admigration1)&amp;quot;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &amp;#34;dc=cgiarad,dc=org&amp;#34; -D &amp;#34;admigration1@cgiarad.org&amp;#34; -W &amp;#34;(sAMAccountName=admigration1)&amp;#34;
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1465,9 +1465,9 @@ $ git rebase -i dspace-5.5
&lt;li&gt;Add &lt;code&gt;dc.description.sponsorship&lt;/code&gt; to Discovery sidebar facets and make investors clickable in item view (&lt;a href=&#34;https://github.com/ilri/DSpace/issues/232&#34;&gt;#232&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;I think this query should find and replace all authors that have &amp;ldquo;,&amp;rdquo; at the end of their names:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &amp;#39;(^.+?),$&amp;#39;, &amp;#39;\1&amp;#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &amp;#39;^.+?,$&amp;#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &amp;#39;^.+?,$&amp;#39;;
text_value
------------
(0 rows)
@ -1505,7 +1505,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;li&gt;I have blocked access to the API now&lt;/li&gt;
&lt;li&gt;There are 3,000 IPs accessing the REST API in a 24-hour period!&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;# awk &amp;#39;{print $1}&amp;#39; /var/log/nginx/rest.log | uniq | wc -l
3168
&lt;/code&gt;&lt;/pre&gt;</description>
</item>
@ -1603,7 +1603,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
&lt;li&gt;Looks like DSpace exhausted its PostgreSQL connection pool&lt;/li&gt;
&lt;li&gt;Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;$ psql -c &amp;#39;SELECT * from pg_stat_activity;&amp;#39; | grep idle | grep -c cgspace
78
&lt;/code&gt;&lt;/pre&gt;</description>
</item>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -221,17 +221,17 @@
<li>I had a call with CodeObia to discuss the work on OpenRXV</li>
<li>Check the results of the AReS harvesting from last night:</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-console" data-lang="console">$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
{
&#34;count&#34; : 100875,
&#34;_shards&#34; : {
&#34;total&#34; : 1,
&#34;successful&#34; : 1,
&#34;skipped&#34; : 0,
&#34;failed&#34; : 0
}
}
</code></pre></div>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-console" data-lang="console"><span style="display:flex;"><span>$ curl -s <span style="color:#e6db74">&#39;http://localhost:9200/openrxv-items-temp/_count?q=*&amp;pretty&#39;</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span> &#34;count&#34; : 100875,
</span></span><span style="display:flex;"><span> &#34;_shards&#34; : {
</span></span><span style="display:flex;"><span> &#34;total&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;successful&#34; : 1,
</span></span><span style="display:flex;"><span> &#34;skipped&#34; : 0,
</span></span><span style="display:flex;"><span> &#34;failed&#34; : 0
</span></span><span style="display:flex;"><span> }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>
<a href='https://alanorth.github.io/cgspace-notes/2021-02/'>Read more →</a>
</article>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -113,17 +113,17 @@
</ul>
</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/*access.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
4671942
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &quot;[0-9]{1,2}/Oct/2019&quot;
# zcat --force /var/log/nginx/{rest,oai,statistics}.log.*.gz | grep -cE &#34;[0-9]{1,2}/Oct/2019&#34;
1277694
</code></pre><ul>
<li>So 4.6 million from XMLUI and another 1.2 million from API requests</li>
<li>Let&rsquo;s see how many of the REST API requests were for bitstreams (because they are counted in Solr stats):</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &quot;[0-9]{1,2}/Oct/2019&quot;
<pre tabindex="0"><code># zcat --force /var/log/nginx/rest.log.*.gz | grep -c -E &#34;[0-9]{1,2}/Oct/2019&#34;
1183456
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &quot;[0-9]{1,2}/Oct/2019&quot; | grep -c -E &quot;/rest/bitstreams&quot;
# zcat --force /var/log/nginx/rest.log.*.gz | grep -E &#34;[0-9]{1,2}/Oct/2019&#34; | grep -c -E &#34;/rest/bitstreams&#34;
106781
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2019-11/'>Read more →</a>
@ -143,7 +143,7 @@
</p>
</header>
2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c 'id,dc.
2019-10-01 Udana from IWMI asked me for a CSV export of their community on CGSpace I exported it, but a quick run through the csv-metadata-quality tool shows that there are some low-hanging fruits we can fix before I send him the data I will limit the scope to the titles, regions, subregions, and river basins for now to manually fix some non-breaking spaces (U+00A0) there that would otherwise be removed by the csv-metadata-quality script&rsquo;s &ldquo;unneccesary Unicode&rdquo; fix: $ csvcut -c &#39;id,dc.
<a href='https://alanorth.github.io/cgspace-notes/2019-10/'>Read more →</a>
</article>
@ -166,7 +166,7 @@
<li>Linode emailed to say that CGSpace (linode18) had a high rate of outbound traffic for several hours this morning</li>
<li>Here are the top ten IPs in the nginx XMLUI and REST/OAI logs this morning:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
440 17.58.101.255
441 157.55.39.101
485 207.46.13.43
@ -177,7 +177,7 @@
814 207.46.13.212
2472 163.172.71.23
6092 3.94.211.189
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &quot;01/Sep/2019:0&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
# zcat --force /var/log/nginx/rest.log /var/log/nginx/rest.log.1 /var/log/nginx/oai.log /var/log/nginx/oai.log.1 | grep -E &#34;01/Sep/2019:0&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
33 2a01:7e00::f03c:91ff:fe16:fcb
57 3.83.192.124
57 3.87.77.25
@ -338,16 +338,16 @@ DELETE 1
</ul>
</li>
</ul>
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep 'Spore-192-EN-web.pdf' | grep -E '(18.196.196.108|18.195.78.144|18.195.218.6)' | awk '{print $9}' | sort | uniq -c | sort -n | tail -n 5
<pre tabindex="0"><code># cat /var/log/nginx/access.log /var/log/nginx/access.log.1 | grep &#39;Spore-192-EN-web.pdf&#39; | grep -E &#39;(18.196.196.108|18.195.78.144|18.195.218.6)&#39; | awk &#39;{print $9}&#39; | sort | uniq -c | sort -n | tail -n 5
4432 200
</code></pre><ul>
<li>In the last two weeks there have been 47,000 downloads of this <em>same exact PDF</em> by these three IP addresses</li>
<li>Apply country and region corrections and deletions on DSpace Test and CGSpace:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p 'fuuu' -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p 'fuuu' -m 231 -f cg.coverage.region -d
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-9-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.country -m 228 -t ACTION -d
$ ./fix-metadata-values.py -i /tmp/2019-02-21-fix-4-regions.csv -db dspace -u dspace -p &#39;fuuu&#39; -f cg.coverage.region -m 231 -t action -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-2-countries.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 228 -f cg.coverage.country -d
$ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace -u dspace -p &#39;fuuu&#39; -m 231 -f cg.coverage.region -d
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2019-04/'>Read more →</a>
</article>
@ -403,7 +403,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>Linode has alerted a few times since last night that the CPU usage on CGSpace (linode18) was high despite me increasing the alert threshold last week from 250% to 275%—I might need to increase it again!</li>
<li>The top IPs before, during, and after this latest alert tonight were:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;01/Feb/2019:(17|18|19|20|21)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;01/Feb/2019:(17|18|19|20|21)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
245 207.46.13.5
332 54.70.40.11
385 5.143.231.38
@ -419,7 +419,7 @@ $ ./delete-metadata-values.py -i /tmp/2019-02-21-delete-1-region.csv -db dspace
<li>The Solr statistics the past few months have been very high and I was wondering if the web server logs also showed an increase</li>
<li>There were just over 3 million accesses in the nginx logs last month:</li>
</ul>
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &quot;[0-9]{1,2}/Jan/2019&quot;
<pre tabindex="0"><code># time zcat --force /var/log/nginx/* | grep -cE &#34;[0-9]{1,2}/Jan/2019&#34;
3018243
real 0m19.873s

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -110,7 +110,7 @@
<li>Linode alerted that CGSpace (linode18) had a higher outbound traffic rate than normal early this morning</li>
<li>I don&rsquo;t see anything interesting in the web server logs around that time though:</li>
</ul>
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &quot;02/Jan/2019:0(1|2|3)&quot; | awk '{print $1}' | sort | uniq -c | sort -n | tail -n 10
<pre tabindex="0"><code># zcat --force /var/log/nginx/*.log /var/log/nginx/*.log.1 | grep -E &#34;02/Jan/2019:0(1|2|3)&#34; | awk &#39;{print $1}&#39; | sort | uniq -c | sort -n | tail -n 10
92 40.77.167.4
99 210.7.29.100
120 38.126.157.45
@ -308,7 +308,7 @@
<li>I added the new CCAFS Phase II Project Tag <code>PII-FP1_PACCA2</code> and merged it into the <code>5_x-prod</code> branch (<a href="https://github.com/ilri/DSpace/pull/379">#379</a>)</li>
<li>I proofed and tested the ILRI author corrections that Peter sent back to me this week:</li>
</ul>
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3 -n
<pre tabindex="0"><code>$ ./fix-metadata-values.py -i /tmp/2018-05-30-Correct-660-authors.csv -db dspace -u dspace -p &#39;fuuu&#39; -f dc.contributor.author -t correct -m 3 -n
</code></pre><ul>
<li>I think a sane proofing workflow in OpenRefine is to apply the custom text facets for check/delete/remove and illegal characters that I developed in <a href="/cgspace-notes/2018-03/">March, 2018</a></li>
<li>Time to index ~70,000 items on CGSpace:</li>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -166,11 +166,11 @@
<li>I notice this error quite a few times in dspace.log:</li>
</ul>
<pre tabindex="0"><code>2018-01-02 01:21:19,137 ERROR org.dspace.app.xmlui.aspect.discovery.SidebarFacetsTransformer @ Error while searching for sidebar facets
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse 'dateIssued_keyword:[1976+TO+1979]': Encountered &quot; &quot;]&quot; &quot;] &quot;&quot; at line 1, column 32.
org.dspace.discovery.SearchServiceException: org.apache.solr.search.SyntaxError: Cannot parse &#39;dateIssued_keyword:[1976+TO+1979]&#39;: Encountered &#34; &#34;]&#34; &#34;] &#34;&#34; at line 1, column 32.
</code></pre><ul>
<li>And there are many of these errors every day for the past month:</li>
</ul>
<pre tabindex="0"><code>$ grep -c &quot;Error while searching for sidebar facets&quot; dspace.log.*
<pre tabindex="0"><code>$ grep -c &#34;Error while searching for sidebar facets&#34; dspace.log.*
dspace.log.2017-11-21:4
dspace.log.2017-11-22:1
dspace.log.2017-11-23:4
@ -266,12 +266,12 @@ dspace.log.2018-01-02:34
<ul>
<li>Today there have been no hits by CORE and no alerts from Linode (coincidence?)</li>
</ul>
<pre tabindex="0"><code># grep -c &quot;CORE&quot; /var/log/nginx/access.log
<pre tabindex="0"><code># grep -c &#34;CORE&#34; /var/log/nginx/access.log
0
</code></pre><ul>
<li>Generate list of authors on CGSpace for Peter to go through and correct:</li>
</ul>
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = 'contributor' and qualifier = 'author') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
<pre tabindex="0"><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where element = &#39;contributor&#39; and qualifier = &#39;author&#39;) AND resource_type_id = 2 group by text_value order by count desc) to /tmp/authors.csv with csv;
COPY 54701
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-11/'>Read more →</a>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -151,7 +151,7 @@
<li>Remove redundant/duplicate text in the DSpace submission license</li>
<li>Testing the CMYK patch on a collection with 650 items:</li>
</ul>
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &quot;ImageMagick PDF Thumbnail&quot; -v &gt;&amp; /tmp/filter-media-cmyk.txt
<pre tabindex="0"><code>$ [dspace]/bin/dspace filter-media -f -i 10568/16498 -p &#34;ImageMagick PDF Thumbnail&#34; -v &gt;&amp; /tmp/filter-media-cmyk.txt
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2017-04/'>Read more →</a>
</article>
@ -210,7 +210,7 @@
<ul>
<li>An item was mapped twice erroneously again, so I had to remove one of the mappings manually:</li>
</ul>
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = '80278';
<pre tabindex="0"><code>dspace=# select * from collection2item where item_id = &#39;80278&#39;;
id | collection_id | item_id
-------+---------------+---------
92551 | 313 | 80278
@ -268,11 +268,11 @@ DELETE 1
<li>CGSpace was down for five hours in the morning while I was sleeping</li>
<li>While looking in the logs for errors, I see tons of warnings about Atmire MQM:</li>
</ul>
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&quot;dc.title&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&quot;THUMBNAIL&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&quot;-1&quot;, transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&quot;TX157907838689377964651674089851855413607&quot;)
<pre tabindex="0"><code>2016-12-02 03:00:32,352 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=CREATE, SubjectType=BUNDLE, SubjectID=70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632305, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY_METADATA, SubjectType=BUNDLE, SubjectID =70316, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632309, dispatcher=1544803905, detail=&#34;dc.title&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=ITEM, SubjectID=80044, Object Type=BUNDLE, ObjectID=70316, TimeStamp=1480647632311, dispatcher=1544803905, detail=&#34;THUMBNAIL&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=ADD, SubjectType=BUNDLE, SubjectID=70316, Obje ctType=BITSTREAM, ObjectID=86715, TimeStamp=1480647632318, dispatcher=1544803905, detail=&#34;-1&#34;, transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
2016-12-02 03:00:32,353 WARN com.atmire.metadataquality.batchedit.BatchEditConsumer @ BatchEditConsumer should not have been given this kind of Subject in an event, skipping: org.dspace.event.Event(eventType=MODIFY, SubjectType=ITEM, SubjectID=80044, ObjectType=(Unknown), ObjectID=-1, TimeStamp=1480647632351, dispatcher=1544803905, detail=[null], transactionID=&#34;TX157907838689377964651674089851855413607&#34;)
</code></pre><ul>
<li>I see thousands of them in the logs for the last few months, so it&rsquo;s not related to the DSpace 5.5 upgrade</li>
<li>I&rsquo;ve raised a ticket with Atmire to ask</li>
@ -354,7 +354,7 @@ DELETE 1
<li>We had been using <code>DC=ILRI</code> to determine whether a user was ILRI or not</li>
<li>It looks like we might be able to use OUs now, instead of DCs:</li>
</ul>
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &quot;dc=cgiarad,dc=org&quot; -D &quot;admigration1@cgiarad.org&quot; -W &quot;(sAMAccountName=admigration1)&quot;
<pre tabindex="0"><code>$ ldapsearch -x -H ldaps://svcgroot2.cgiarad.org:3269/ -b &#34;dc=cgiarad,dc=org&#34; -D &#34;admigration1@cgiarad.org&#34; -W &#34;(sAMAccountName=admigration1)&#34;
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2016-09/'>Read more →</a>
</article>

View File

@ -10,14 +10,14 @@
<meta property="og:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository." />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/posts/" />
<meta property="og:updated_time" content="2022-03-01T17:17:27+03:00" />
<meta property="og:updated_time" content="2022-03-01T17:48:40+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Posts"/>
<meta name="twitter:description" content="Documenting day-to-day work on the [CGSpace](https://cgspace.cgiar.org) repository."/>
<meta name="generator" content="Hugo 0.92.2" />
<meta name="generator" content="Hugo 0.93.1" />
@ -140,9 +140,9 @@ $ git rebase -i dspace-5.5
<li>Add <code>dc.description.sponsorship</code> to Discovery sidebar facets and make investors clickable in item view (<a href="https://github.com/ilri/DSpace/issues/232">#232</a>)</li>
<li>I think this query should find and replace all authors that have &ldquo;,&rdquo; at the end of their names:</li>
</ul>
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, '(^.+?),$', '\1') where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
<pre tabindex="0"><code>dspacetest=# update metadatavalue set text_value = regexp_replace(text_value, &#39;(^.+?),$&#39;, &#39;\1&#39;) where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
UPDATE 95
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ '^.+?,$';
dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and resource_type_id=2 and text_value ~ &#39;^.+?,$&#39;;
text_value
------------
(0 rows)
@ -198,7 +198,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<li>I have blocked access to the API now</li>
<li>There are 3,000 IPs accessing the REST API in a 24-hour period!</li>
</ul>
<pre tabindex="0"><code># awk '{print $1}' /var/log/nginx/rest.log | uniq | wc -l
<pre tabindex="0"><code># awk &#39;{print $1}&#39; /var/log/nginx/rest.log | uniq | wc -l
3168
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2016-05/'>Read more →</a>
@ -350,7 +350,7 @@ dspacetest=# select text_value from metadatavalue where metadata_field_id=3 and
<li>Looks like DSpace exhausted its PostgreSQL connection pool</li>
<li>Last week I had increased the limit from 30 to 60, which seemed to help, but now there are many more idle connections:</li>
</ul>
<pre tabindex="0"><code>$ psql -c 'SELECT * from pg_stat_activity;' | grep idle | grep -c cgspace
<pre tabindex="0"><code>$ psql -c &#39;SELECT * from pg_stat_activity;&#39; | grep idle | grep -c cgspace
78
</code></pre>
<a href='https://alanorth.github.io/cgspace-notes/2015-11/'>Read more →</a>