diff --git a/content/post/2018-01.md b/content/post/2018-01.md index 6ff7cfe4d..a0abb4aad 100644 --- a/content/post/2018-01.md +++ b/content/post/2018-01.md @@ -593,3 +593,94 @@ Caused by: java.lang.NullPointerException - Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used - Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434 - I'll comment on that issue + +## 2018-01-14 + +- Looking at the authors Peter had corrected +- Some had multiple and he's corrected them by adding `||` in the correction column, but I can't process those this way so I will just have to flag them and do those manually later +- Also, I can flag the values that have "DELETE" +- Then I need to facet the correction column on isBlank(value) and not flagged + +## 2018-01-15 + +- Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload +- I'm going to apply these ~130 corrections on CGSpace: + +``` +update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published'; +delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO'; +update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)'; +update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)'; +update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)'; +update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese'; +update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru'; +update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)'; +delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)'; +``` + +- Continue proofing Peter's author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names + +![OpenRefine Authors](/cgspace-notes/2018/01/openrefine-authors.png) + +- Apply corrections using [fix-metadata-values.py](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897): + +``` +$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu' +``` + +- In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL: + +``` +dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali'; + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id +-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------ + 2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2 +(1 row) + +dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369'; + handle +-------- +(0 rows) +``` + +- Even searching in the DSpace advanced search for author equals "Tarawali" produces nothing... +- Otherwise, the [DSpace 5 SQL Helper Functions](https://wiki.duraspace.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5) provide `ds5_item2itemhandle()`, which is much easier than my long query above that I always have to go search for +- For example, to find the Handle for an item that has the author "Erni": + +``` +dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni'; + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id +-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------ + 2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2 +(1 row) +dspace=# select ds5_item2itemhandle(70308); + ds5_item2itemhandle +--------------------- + 10568/68609 +(1 row) +``` + +- Next I apply the author deletions: + +``` +$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu' +``` + +- Now working on the affiliation corrections from Peter: + +``` +$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu' +$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu' +``` + +- Now I made a new list of affiliations for Peter to look through: + +``` +dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv; +COPY 4552 +``` + +- Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT) +- For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930 +- Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture +- So some submitters don't know to use the controlled vocabulary lookup diff --git a/public/2018-01/index.html b/public/2018-01/index.html index 856cfe8b4..0074e26ea 100644 --- a/public/2018-01/index.html +++ b/public/2018-01/index.html @@ -92,7 +92,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv - + @@ -194,9 +194,9 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv "@type": "BlogPosting", "headline": "January, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-01/", - "wordCount": "2965", + "wordCount": "3592", "datePublished": "2018-01-02T08:35:54-08:00", - "dateModified": "2018-01-12T07:55:01+02:00", + "dateModified": "2018-01-13T18:04:45+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -911,6 +911,110 @@ Caused by: java.lang.NullPointerException
||
in the correction column, but I can’t process those this way so I will just have to flag them and do those manually laterupdate metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
+delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
+update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
+update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
+update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
+update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
+update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
+update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
+delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
+
+
+$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
+
+
+dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
+ metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
+-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
+ 2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2
+(1 row)
+
+dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
+ handle
+--------
+(0 rows)
+
+
+ds5_item2itemhandle()
, which is much easier than my long query above that I always have to go search fordspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
+ metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
+-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
+ 2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2
+(1 row)
+dspace=# select ds5_item2itemhandle(70308);
+ ds5_item2itemhandle
+---------------------
+ 10568/68609
+(1 row)
+
+
+$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
+
+
+$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
+$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
+
+
+dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
+COPY 4552
+
+
+