Add notes for 2018-01-15

This commit is contained in:
2018-01-15 12:18:26 +02:00
parent c762de7743
commit 2aed7e9d4c
5 changed files with 203 additions and 8 deletions

View File

@ -593,3 +593,94 @@ Caused by: java.lang.NullPointerException
- Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used
- Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434
- I'll comment on that issue
## 2018-01-14
- Looking at the authors Peter had corrected
- Some had multiple and he's corrected them by adding `||` in the correction column, but I can't process those this way so I will just have to flag them and do those manually later
- Also, I can flag the values that have "DELETE"
- Then I need to facet the correction column on isBlank(value) and not flagged
## 2018-01-15
- Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload
- I'm going to apply these ~130 corrections on CGSpace:
```
update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
```
- Continue proofing Peter's author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names
![OpenRefine Authors](/cgspace-notes/2018/01/openrefine-authors.png)
- Apply corrections using [fix-metadata-values.py](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897):
```
$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
```
- In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL:
```
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2
(1 row)
dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
handle
--------
(0 rows)
```
- Even searching in the DSpace advanced search for author equals "Tarawali" produces nothing...
- Otherwise, the [DSpace 5 SQL Helper Functions](https://wiki.duraspace.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5) provide `ds5_item2itemhandle()`, which is much easier than my long query above that I always have to go search for
- For example, to find the Handle for an item that has the author "Erni":
```
dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2
(1 row)
dspace=# select ds5_item2itemhandle(70308);
ds5_item2itemhandle
---------------------
10568/68609
(1 row)
```
- Next I apply the author deletions:
```
$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
```
- Now working on the affiliation corrections from Peter:
```
$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
```
- Now I made a new list of affiliations for Peter to look through:
```
dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
COPY 4552
```
- Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT)
- For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930
- Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture
- So some submitters don't know to use the controlled vocabulary lookup