mirror of
https://github.com/alanorth/cgspace-notes.git
synced 2024-11-26 00:18:21 +01:00
Add notes for 2018-01-15
This commit is contained in:
parent
c762de7743
commit
2aed7e9d4c
@ -593,3 +593,94 @@ Caused by: java.lang.NullPointerException
|
|||||||
- Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used
|
- Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used
|
||||||
- Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434
|
- Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434
|
||||||
- I'll comment on that issue
|
- I'll comment on that issue
|
||||||
|
|
||||||
|
## 2018-01-14
|
||||||
|
|
||||||
|
- Looking at the authors Peter had corrected
|
||||||
|
- Some had multiple and he's corrected them by adding `||` in the correction column, but I can't process those this way so I will just have to flag them and do those manually later
|
||||||
|
- Also, I can flag the values that have "DELETE"
|
||||||
|
- Then I need to facet the correction column on isBlank(value) and not flagged
|
||||||
|
|
||||||
|
## 2018-01-15
|
||||||
|
|
||||||
|
- Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload
|
||||||
|
- I'm going to apply these ~130 corrections on CGSpace:
|
||||||
|
|
||||||
|
```
|
||||||
|
update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
|
||||||
|
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
|
||||||
|
update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
|
||||||
|
update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
|
||||||
|
update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
|
||||||
|
update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
|
||||||
|
update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
|
||||||
|
update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
|
||||||
|
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
|
||||||
|
```
|
||||||
|
|
||||||
|
- Continue proofing Peter's author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names
|
||||||
|
|
||||||
|
![OpenRefine Authors](/cgspace-notes/2018/01/openrefine-authors.png)
|
||||||
|
|
||||||
|
- Apply corrections using [fix-metadata-values.py](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897):
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
|
||||||
|
```
|
||||||
|
|
||||||
|
- In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:
|
||||||
|
|
||||||
|
```
|
||||||
|
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
|
||||||
|
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
|
||||||
|
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
|
||||||
|
2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2
|
||||||
|
(1 row)
|
||||||
|
|
||||||
|
dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
|
||||||
|
handle
|
||||||
|
--------
|
||||||
|
(0 rows)
|
||||||
|
```
|
||||||
|
|
||||||
|
- Even searching in the DSpace advanced search for author equals "Tarawali" produces nothing...
|
||||||
|
- Otherwise, the [DSpace 5 SQL Helper Functions](https://wiki.duraspace.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5) provide `ds5_item2itemhandle()`, which is much easier than my long query above that I always have to go search for
|
||||||
|
- For example, to find the Handle for an item that has the author "Erni":
|
||||||
|
|
||||||
|
```
|
||||||
|
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
|
||||||
|
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
|
||||||
|
-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
|
||||||
|
2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2
|
||||||
|
(1 row)
|
||||||
|
dspace=# select ds5_item2itemhandle(70308);
|
||||||
|
ds5_item2itemhandle
|
||||||
|
---------------------
|
||||||
|
10568/68609
|
||||||
|
(1 row)
|
||||||
|
```
|
||||||
|
|
||||||
|
- Next I apply the author deletions:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
|
||||||
|
```
|
||||||
|
|
||||||
|
- Now working on the affiliation corrections from Peter:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
|
||||||
|
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
|
||||||
|
```
|
||||||
|
|
||||||
|
- Now I made a new list of affiliations for Peter to look through:
|
||||||
|
|
||||||
|
```
|
||||||
|
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
|
||||||
|
COPY 4552
|
||||||
|
```
|
||||||
|
|
||||||
|
- Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)
|
||||||
|
- For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930
|
||||||
|
- Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture
|
||||||
|
- So some submitters don't know to use the controlled vocabulary lookup
|
||||||
|
@ -92,7 +92,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
|
|||||||
|
|
||||||
<meta property="article:published_time" content="2018-01-02T08:35:54-08:00"/>
|
<meta property="article:published_time" content="2018-01-02T08:35:54-08:00"/>
|
||||||
|
|
||||||
<meta property="article:modified_time" content="2018-01-12T07:55:01+02:00"/>
|
<meta property="article:modified_time" content="2018-01-13T18:04:45+02:00"/>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -194,9 +194,9 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv
|
|||||||
"@type": "BlogPosting",
|
"@type": "BlogPosting",
|
||||||
"headline": "January, 2018",
|
"headline": "January, 2018",
|
||||||
"url": "https://alanorth.github.io/cgspace-notes/2018-01/",
|
"url": "https://alanorth.github.io/cgspace-notes/2018-01/",
|
||||||
"wordCount": "2965",
|
"wordCount": "3592",
|
||||||
"datePublished": "2018-01-02T08:35:54-08:00",
|
"datePublished": "2018-01-02T08:35:54-08:00",
|
||||||
"dateModified": "2018-01-12T07:55:01+02:00",
|
"dateModified": "2018-01-13T18:04:45+02:00",
|
||||||
"author": {
|
"author": {
|
||||||
"@type": "Person",
|
"@type": "Person",
|
||||||
"name": "Alan Orth"
|
"name": "Alan Orth"
|
||||||
@ -911,6 +911,110 @@ Caused by: java.lang.NullPointerException
|
|||||||
<li>I’ll comment on that issue</li>
|
<li>I’ll comment on that issue</li>
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
|
<h2 id="2018-01-14">2018-01-14</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Looking at the authors Peter had corrected</li>
|
||||||
|
<li>Some had multiple and he’s corrected them by adding <code>||</code> in the correction column, but I can’t process those this way so I will just have to flag them and do those manually later</li>
|
||||||
|
<li>Also, I can flag the values that have “DELETE”</li>
|
||||||
|
<li>Then I need to facet the correction column on isBlank(value) and not flagged</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<h2 id="2018-01-15">2018-01-15</h2>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload</li>
|
||||||
|
<li>I’m going to apply these ~130 corrections on CGSpace:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
|
||||||
|
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
|
||||||
|
update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
|
||||||
|
update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
|
||||||
|
update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
|
||||||
|
update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
|
||||||
|
update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
|
||||||
|
update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
|
||||||
|
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Continue proofing Peter’s author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<p><img src="/cgspace-notes/2018/01/openrefine-authors.png" alt="OpenRefine Authors" /></p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Apply corrections using <a href="https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897">fix-metadata-values.py</a>:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
|
||||||
|
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
|
||||||
|
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
|
||||||
|
2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2
|
||||||
|
(1 row)
|
||||||
|
|
||||||
|
dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
|
||||||
|
handle
|
||||||
|
--------
|
||||||
|
(0 rows)
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Even searching in the DSpace advanced search for author equals “Tarawali” produces nothing…</li>
|
||||||
|
<li>Otherwise, the <a href="https://wiki.duraspace.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5">DSpace 5 SQL Helper Functions</a> provide <code>ds5_item2itemhandle()</code>, which is much easier than my long query above that I always have to go search for</li>
|
||||||
|
<li>For example, to find the Handle for an item that has the author “Erni”:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
|
||||||
|
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
|
||||||
|
-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
|
||||||
|
2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2
|
||||||
|
(1 row)
|
||||||
|
dspace=# select ds5_item2itemhandle(70308);
|
||||||
|
ds5_item2itemhandle
|
||||||
|
---------------------
|
||||||
|
10568/68609
|
||||||
|
(1 row)
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Next I apply the author deletions:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Now working on the affiliation corrections from Peter:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
|
||||||
|
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Now I made a new list of affiliations for Peter to look through:</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<pre><code>dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
|
||||||
|
COPY 4552
|
||||||
|
</code></pre>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li>Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)</li>
|
||||||
|
<li>For example, this one is from just last month: <a href="https://cgspace.cgiar.org/handle/10568/89930">https://cgspace.cgiar.org/handle/10568/89930</a></li>
|
||||||
|
<li>Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture</li>
|
||||||
|
<li>So some submitters don’t know to use the controlled vocabulary lookup</li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
BIN
public/2018/01/openrefine-authors.png
Normal file
BIN
public/2018/01/openrefine-authors.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 108 KiB |
@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/2018-01/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/2018-01/</loc>
|
||||||
<lastmod>2018-01-12T07:55:01+02:00</lastmod>
|
<lastmod>2018-01-13T18:04:45+02:00</lastmod>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
@ -144,7 +144,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/</loc>
|
||||||
<lastmod>2018-01-12T07:55:01+02:00</lastmod>
|
<lastmod>2018-01-13T18:04:45+02:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -155,7 +155,7 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
|
||||||
<lastmod>2018-01-12T07:55:01+02:00</lastmod>
|
<lastmod>2018-01-13T18:04:45+02:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
@ -167,13 +167,13 @@
|
|||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
|
||||||
<lastmod>2018-01-12T07:55:01+02:00</lastmod>
|
<lastmod>2018-01-13T18:04:45+02:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
<url>
|
<url>
|
||||||
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
|
||||||
<lastmod>2018-01-12T07:55:01+02:00</lastmod>
|
<lastmod>2018-01-13T18:04:45+02:00</lastmod>
|
||||||
<priority>0</priority>
|
<priority>0</priority>
|
||||||
</url>
|
</url>
|
||||||
|
|
||||||
|
BIN
static/2018/01/openrefine-authors.png
Normal file
BIN
static/2018/01/openrefine-authors.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 108 KiB |
Loading…
Reference in New Issue
Block a user