mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
Add notes for 2018-03-25
This commit is contained in:
parent
e95f2c2f49
commit
c070fda9b3
@ -445,3 +445,48 @@ isNotNull(value.match(/.*\ufffd.*/))
|
||||
|
||||
- More work on the Ubuntu 18.04 readiness stuff for the [Ansible playbooks](https://github.com/ilri/rmg-ansible-public)
|
||||
- The playbook now uses the system's Ruby and Node.js so I don't have to manually install RVM and NVM after
|
||||
|
||||
## 2018-03-25
|
||||
|
||||
- Looking at Peter's author corrections and trying to work out a way to find errors in OpenRefine easily
|
||||
- I can find all names that have acceptable characters using a GREL expression like:
|
||||
|
||||
```
|
||||
isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
|
||||
```
|
||||
|
||||
- But it's probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):
|
||||
|
||||
```
|
||||
or(
|
||||
isNotNull(value.match(/.*[(|)].*/)),
|
||||
isNotNull(value.match(/.*\uFFFD.*/)),
|
||||
isNotNull(value.match(/.*\u00A0.*/)),
|
||||
isNotNull(value.match(/.*\u200A.*/))
|
||||
)
|
||||
```
|
||||
|
||||
- And here's one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it's time to add delete support to my `fix-metadata-values.py` script:
|
||||
|
||||
```
|
||||
or(
|
||||
isNotNull(value.match(/.*delete.*/i)),
|
||||
isNotNull(value.match(/.*remove.*/i)),
|
||||
isNotNull(value.match(/.*check.*/i))
|
||||
)
|
||||
```
|
||||
|
||||
- So I guess the routine is in OpenRefine is:
|
||||
- Transform: trim leading/trailing whitespace
|
||||
- Transform: collapse consecutive whitespace
|
||||
- Custom text facet for items to delete/check
|
||||
- Custom text facet for illegal characters
|
||||
|
||||
- Test the corrections and deletions locally, then run them on CGSpace:
|
||||
|
||||
```
|
||||
$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
|
||||
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
|
||||
```
|
||||
|
||||
- Afterwards I started a full Discovery reindexing
|
||||
|
@ -20,7 +20,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
|
||||
|
||||
<meta property="article:published_time" content="2018-03-02T16:07:54+02:00"/>
|
||||
|
||||
<meta property="article:modified_time" content="2018-03-22T23:07:03+02:00"/>
|
||||
<meta property="article:modified_time" content="2018-03-24T22:03:00+02:00"/>
|
||||
|
||||
|
||||
|
||||
@ -51,9 +51,9 @@ Export a CSV of the IITA community metadata for Martin Mueller
|
||||
"@type": "BlogPosting",
|
||||
"headline": "March, 2018",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2018-03/",
|
||||
"wordCount": "2509",
|
||||
"wordCount": "2695",
|
||||
"datePublished": "2018-03-02T16:07:54+02:00",
|
||||
"dateModified": "2018-03-22T23:07:03+02:00",
|
||||
"dateModified": "2018-03-24T22:03:00+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -626,6 +626,60 @@ sys 2m45.135s
|
||||
<li>The playbook now uses the system’s Ruby and Node.js so I don’t have to manually install RVM and NVM after</li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-03-25">2018-03-25</h2>
|
||||
|
||||
<ul>
|
||||
<li>Looking at Peter’s author corrections and trying to work out a way to find errors in OpenRefine easily</li>
|
||||
<li>I can find all names that have acceptable characters using a GREL expression like:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>isNotNull(value.match(/.*[a-zA-ZáÁéèïíñØøöóúü].*/))
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>But it’s probably better to just say which characters I know for sure are not valid (like parentheses, pipe, or weird Unicode characters):</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>or(
|
||||
isNotNull(value.match(/.*[(|)].*/)),
|
||||
isNotNull(value.match(/.*\uFFFD.*/)),
|
||||
isNotNull(value.match(/.*\u00A0.*/)),
|
||||
isNotNull(value.match(/.*\u200A.*/))
|
||||
)
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>And here’s one combined GREL expression to check for items marked as to delete or check so I can flag them and export them to a separate CSV (though perhaps it’s time to add delete support to my <code>fix-metadata-values.py</code> script:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>or(
|
||||
isNotNull(value.match(/.*delete.*/i)),
|
||||
isNotNull(value.match(/.*remove.*/i)),
|
||||
isNotNull(value.match(/.*check.*/i))
|
||||
)
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li><p>So I guess the routine is in OpenRefine is:</p>
|
||||
|
||||
<ul>
|
||||
<li>Transform: trim leading/trailing whitespace</li>
|
||||
<li>Transform: collapse consecutive whitespace</li>
|
||||
<li>Custom text facet for items to delete/check</li>
|
||||
<li>Custom text facet for illegal characters</li>
|
||||
</ul></li>
|
||||
|
||||
<li><p>Test the corrections and deletions locally, then run them on CGSpace:</p></li>
|
||||
</ul>
|
||||
|
||||
<pre><code>$ ./fix-metadata-values.py -i /tmp/Correct-2928-Authors-2018-03-21.csv -db dspace -u dspace -p 'fuuu' -f dc.contributor.author -t correct -m 3
|
||||
$ ./delete-metadata-values.py -i /tmp/Delete-8-Authors-2018-03-21.csv -f dc.contributor.author -m 3 -db dspacetest -u dspace -p 'fuuu'
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Afterwards I started a full Discovery reindexing</li>
|
||||
</ul>
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2018-03/</loc>
|
||||
<lastmod>2018-03-22T23:07:03+02:00</lastmod>
|
||||
<lastmod>2018-03-24T22:03:00+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
@ -154,7 +154,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2018-03-22T23:07:03+02:00</lastmod>
|
||||
<lastmod>2018-03-24T22:03:00+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -165,7 +165,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2018-03-22T23:07:03+02:00</lastmod>
|
||||
<lastmod>2018-03-24T22:03:00+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -177,13 +177,13 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/posts/</loc>
|
||||
<lastmod>2018-03-22T23:07:03+02:00</lastmod>
|
||||
<lastmod>2018-03-24T22:03:00+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2018-03-22T23:07:03+02:00</lastmod>
|
||||
<lastmod>2018-03-24T22:03:00+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user