mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-22 14:45:03 +01:00
Update
This commit is contained in:
parent
0bd871a13a
commit
79c025af88
@ -56,3 +56,65 @@ UPDATE 1
|
||||
- Help Sisay proof 200 IITA records on DSpace Test
|
||||
- Finally import Udana's 24 items to [IWMI Journal Articles](https://cgspace.cgiar.org/handle/10568/36185) on CGSpace
|
||||
- Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc
|
||||
|
||||
## 2018-03-08
|
||||
|
||||
- Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata
|
||||
- This makes the CSV have tons of columns, for example `dc.title`, `dc.title[]`, `dc.title[en]`, `dc.title[eng]`, `dc.title[en_US]` and so on!
|
||||
- I think I can fix — or at least normalize — them in the database:
|
||||
|
||||
```
|
||||
dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
||||
text_lang
|
||||
-----------
|
||||
|
||||
ethnob
|
||||
en
|
||||
spa
|
||||
EN
|
||||
En
|
||||
en_
|
||||
en_US
|
||||
E.
|
||||
|
||||
EN_US
|
||||
en_U
|
||||
eng
|
||||
fr
|
||||
es_ES
|
||||
es
|
||||
(16 rows)
|
||||
|
||||
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
|
||||
UPDATE 122227
|
||||
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
||||
text_lang
|
||||
-----------
|
||||
|
||||
ethnob
|
||||
en_US
|
||||
spa
|
||||
E.
|
||||
|
||||
fr
|
||||
es_ES
|
||||
es
|
||||
(9 rows)
|
||||
```
|
||||
|
||||
- In other news, I was playing with adding ORCID identifiers to a dump of CIAT's community via CSV in OpenRefine
|
||||
- Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the `cg.creator.id` field
|
||||
- For example, a GREL expression in a custom text facet to get all items with `dc.contributor.author[en_US]` of a certain author with several name variations (this is how you use a logical OR in OpenRefine):
|
||||
|
||||
```
|
||||
or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
|
||||
```
|
||||
|
||||
- Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:
|
||||
|
||||
```
|
||||
if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
|
||||
```
|
||||
|
||||
- One thing that bothers me is that this won't honor author order
|
||||
- It might be better to do batches of these in PostgreSQL with a script that takes the `place` column of an author into account when setting the `cg.creator.id`
|
||||
|
@ -20,7 +20,7 @@ Export a CSV of the IITA community metadata for Martin Mueller
|
||||
|
||||
<meta property="article:published_time" content="2018-03-02T16:07:54+02:00"/>
|
||||
|
||||
<meta property="article:modified_time" content="2018-03-07T12:29:24+02:00"/>
|
||||
<meta property="article:modified_time" content="2018-03-08T15:05:29+02:00"/>
|
||||
|
||||
|
||||
|
||||
@ -51,9 +51,9 @@ Export a CSV of the IITA community metadata for Martin Mueller
|
||||
"@type": "BlogPosting",
|
||||
"headline": "March, 2018",
|
||||
"url": "https://alanorth.github.io/cgspace-notes/2018-03/",
|
||||
"wordCount": "323",
|
||||
"wordCount": "599",
|
||||
"datePublished": "2018-03-02T16:07:54+02:00",
|
||||
"dateModified": "2018-03-07T12:29:24+02:00",
|
||||
"dateModified": "2018-03-08T15:05:29+02:00",
|
||||
"author": {
|
||||
"@type": "Person",
|
||||
"name": "Alan Orth"
|
||||
@ -182,6 +182,73 @@ UPDATE 1
|
||||
<li>Skype with James Stapleton to discuss CGSpace, ILRI website, CKM staff issues, etc</li>
|
||||
</ul>
|
||||
|
||||
<h2 id="2018-03-08">2018-03-08</h2>
|
||||
|
||||
<ul>
|
||||
<li>Looking at a CSV dump of the CIAT community I see there are tons of stupid text languages people add for their metadata</li>
|
||||
<li>This makes the CSV have tons of columns, for example <code>dc.title</code>, <code>dc.title[]</code>, <code>dc.title[en]</code>, <code>dc.title[eng]</code>, <code>dc.title[en_US]</code> and so on!</li>
|
||||
<li>I think I can fix — or at least normalize — them in the database:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>dspace=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
||||
text_lang
|
||||
-----------
|
||||
|
||||
ethnob
|
||||
en
|
||||
spa
|
||||
EN
|
||||
En
|
||||
en_
|
||||
en_US
|
||||
E.
|
||||
|
||||
EN_US
|
||||
en_U
|
||||
eng
|
||||
fr
|
||||
es_ES
|
||||
es
|
||||
(16 rows)
|
||||
|
||||
dspace=# update metadatavalue set text_lang='en_US' where resource_type_id=2 and text_lang in ('en','EN','En','en_','EN_US','en_U','eng');
|
||||
UPDATE 122227
|
||||
dspacetest=# select distinct text_lang from metadatavalue where resource_type_id=2;
|
||||
text_lang
|
||||
-----------
|
||||
|
||||
ethnob
|
||||
en_US
|
||||
spa
|
||||
E.
|
||||
|
||||
fr
|
||||
es_ES
|
||||
es
|
||||
(9 rows)
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>In other news, I was playing with adding ORCID identifiers to a dump of CIAT’s community via CSV in OpenRefine</li>
|
||||
<li>Using a series of filters, flags, and GREL expressions to isolate items for a certain author, I figured out how to add ORCID identifiers to the <code>cg.creator.id</code> field</li>
|
||||
<li>For example, a GREL expression in a custom text facet to get all items with <code>dc.contributor.author[en_US]</code> of a certain author with several name variations (this is how you use a logical OR in OpenRefine):</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>or(value.contains('Ceballos, Hern'), value.contains('Hernández Ceballos'))
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>Then you can flag or star matching items and then use a conditional to either set the value directly or add it to an existing value:</li>
|
||||
</ul>
|
||||
|
||||
<pre><code>if(isBlank(value), "Hernan Ceballos: 0000-0002-8744-7918", value + "||Hernan Ceballos: 0000-0002-8744-7918")
|
||||
</code></pre>
|
||||
|
||||
<ul>
|
||||
<li>One thing that bothers me is that this won’t honor author order</li>
|
||||
<li>It might be better to do batches of these in PostgreSQL with a script that takes the <code>place</code> column of an author into account when setting the <code>cg.creator.id</code></li>
|
||||
</ul>
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/2018-03/</loc>
|
||||
<lastmod>2018-03-07T12:29:24+02:00</lastmod>
|
||||
<lastmod>2018-03-08T15:05:29+02:00</lastmod>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
@ -154,7 +154,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||
<lastmod>2018-03-07T12:29:24+02:00</lastmod>
|
||||
<lastmod>2018-03-08T15:05:29+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -165,7 +165,7 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||
<lastmod>2018-03-07T12:29:24+02:00</lastmod>
|
||||
<lastmod>2018-03-08T15:05:29+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
@ -177,13 +177,13 @@
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
|
||||
<lastmod>2018-03-07T12:29:24+02:00</lastmod>
|
||||
<lastmod>2018-03-08T15:05:29+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
<url>
|
||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||
<lastmod>2018-03-07T12:29:24+02:00</lastmod>
|
||||
<lastmod>2018-03-08T15:05:29+02:00</lastmod>
|
||||
<priority>0</priority>
|
||||
</url>
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user