Add notes for 2020-10-13 and 2020-10-14

This commit is contained in:
2020-10-14 22:21:03 +03:00
parent 076cb51cc6
commit ae2c5bd8f6
89 changed files with 263 additions and 115 deletions

View File

@ -23,7 +23,7 @@ During the FlywayDB migration I got an error:
<meta property="og:type" content="article" />
<meta property="og:url" content="https://alanorth.github.io/cgspace-notes/2020-10/" />
<meta property="article:published_time" content="2020-10-06T16:55:54+03:00" />
<meta property="article:modified_time" content="2020-10-08T15:54:02+03:00" />
<meta property="article:modified_time" content="2020-10-12T17:53:24+03:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="October, 2020"/>
@ -41,7 +41,7 @@ During the FlywayDB migration I got an error:
"/>
<meta name="generator" content="Hugo 0.76.3" />
<meta name="generator" content="Hugo 0.76.4" />
@ -51,9 +51,9 @@ During the FlywayDB migration I got an error:
"@type": "BlogPosting",
"headline": "October, 2020",
"url": "https://alanorth.github.io/cgspace-notes/2020-10/",
"wordCount": "1895",
"wordCount": "2381",
"datePublished": "2020-10-06T16:55:54+03:00",
"dateModified": "2020-10-08T15:54:02+03:00",
"dateModified": "2020-10-12T17:53:24+03:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -459,7 +459,87 @@ Purging 1062 hits from Vizzit in statistics-2018
Purging 920 hits from Scoop\.it in statistics-2018
Total number of bot hits purged: 3684
</code></pre><!-- raw HTML omitted -->
</code></pre><h2 id="2020-10-13">2020-10-13</h2>
<ul>
<li>Skype with Peter about AReS again
<ul>
<li>We decided to use Title Case for our countries on CGSpace to minimize the need for mapping on AReS</li>
<li>We did some work to add a dozen more mappings for strange and incorrect CRPs on AReS</li>
</ul>
</li>
<li>I can update the country metadata in PostgreSQL like this:</li>
</ul>
<pre><code>dspace=&gt; BEGIN;
dspace=&gt; UPDATE metadatavalue SET text_value=INITCAP(text_value) WHERE resource_type_id=2 AND metadata_field_id=228;
UPDATE 51756
dspace=&gt; COMMIT;
</code></pre><ul>
<li>I will need to pay special attention to Côte d&rsquo;Ivoire, Bosnia and Herzegovina, and a few others though&hellip; maybe better do search and replace using <code>fix-metadata-values.csv</code>
<ul>
<li>Export a list of distinct values from the database:</li>
</ul>
</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.country&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=228) TO /tmp/2020-10-13-countries.csv WITH CSV HEADER;
COPY 195
</code></pre><ul>
<li>Then use OpenRefine and make a new column for corrections, then use this GREL to convert to title case: <code>value.toTitlecase()</code>
<ul>
<li>I still had to double check everything to catch some corner cases (Andorra, Timor-leste, etc)</li>
</ul>
</li>
<li>For the input forms I found out how to do a complicated search and replace in vim:</li>
</ul>
<pre><code>:'&lt;,'&gt;s/\&lt;\(pair\|displayed\|stored\|value\|AND\)\@!\(\w\)\(\w*\|\)\&gt;/\u\2\L\3/g
</code></pre><ul>
<li>It uses a <a href="https://jbodah.github.io/blog/2016/11/01/positivenegative-lookaheadlookbehind-vim/">negative lookahead</a> (aka &ldquo;lookaround&rdquo; in PCRE?) to match words that are <em>not</em> &ldquo;pair&rdquo;, &ldquo;displayed&rdquo;, etc because we don&rsquo;t want to edit the XML tags themselves&hellip;
<ul>
<li>I had to fix a few manually after doing this, as above with PostgreSQL</li>
</ul>
</li>
</ul>
<h2 id="2020-10-14">2020-10-14</h2>
<ul>
<li>I discussed the title casing of countries with Abenet and she suggested we also apply title casing to regions
<ul>
<li>I exported the list of regions from the database:</li>
</ul>
</li>
</ul>
<pre><code>dspace=&gt; \COPY (SELECT DISTINCT(text_value) as &quot;cg.coverage.region&quot; FROM metadatavalue WHERE resource_type_id=2 AND metadata_field_id=227) TO /tmp/2020-10-14-regions.csv WITH CSV HEADER;
COPY 34
</code></pre><ul>
<li>I did the same as the countries in OpenRefine for the database values and in vim for the input forms</li>
<li>After testing the replacements locally I ran them on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i /tmp/2020-10-13-CGSpace-countries.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.country -t 'correct' -m 228
$ ./fix-metadata-values.py -i /tmp/2020-10-14-CGSpace-regions.csv -db dspace -u dspace -p 'fuuu' -f cg.coverage.region -t 'correct' -m 227
</code></pre><ul>
<li>Then I started a full re-indexing:</li>
</ul>
<pre><code>$ time chrt -b 0 ionice -c2 -n7 nice -n19 dspace index-discovery -b
real 88m21.678s
user 7m59.182s
sys 2m22.713s
</code></pre><ul>
<li>I added a dozen or so more mappings to fix some country outliers on AReS
<ul>
<li>I will start a fresh harvest there once the Discovery update is done on CGSpace</li>
</ul>
</li>
<li>I also adjusted my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts to work on DSpace 6 where there is no more <code>resource_type_id</code> field
<ul>
<li>I will need to do it on a few more scripts as well, but I&rsquo;ll do that after we migrate to DSpace 6 because those scripts are less important</li>
</ul>
</li>
<li>I found a new setting in DSpace 6&rsquo;s <code>usage-statistics.cfg</code> about case insensitive matching of bots that defaults to false, so I enabled it in our DSpace 6 branch
<ul>
<li>I am curious to see if that resolves the strange issues I noticed yesterday about bot matching of patterns in the spider agents file completely not working</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->