diff --git a/content/post/2018-01.md b/content/post/2018-01.md index 6ff7cfe4d..a0abb4aad 100644 --- a/content/post/2018-01.md +++ b/content/post/2018-01.md @@ -593,3 +593,94 @@ Caused by: java.lang.NullPointerException - Also, the fallback connection parameters specified in local.cfg (not dspace.cfg) are used - Shit, this might actually be a DSpace error: https://jira.duraspace.org/browse/DS-3434 - I'll comment on that issue + +## 2018-01-14 + +- Looking at the authors Peter had corrected +- Some had multiple and he's corrected them by adding `||` in the correction column, but I can't process those this way so I will just have to flag them and do those manually later +- Also, I can flag the values that have "DELETE" +- Then I need to facet the correction column on isBlank(value) and not flagged + +## 2018-01-15 + +- Help Udana from IWMI export a CSV from DSpace Test so he can start trying a batch upload +- I'm going to apply these ~130 corrections on CGSpace: + +``` +update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published'; +delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO'; +update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)'; +update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)'; +update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)'; +update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese'; +update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru'; +update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)'; +delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)'; +``` + +- Continue proofing Peter's author corrections that I started yesterday, faceting on non blank, non flagged, and briefly scrolling through the values of the corrections to find encoding errors for French and Spanish names + +![OpenRefine Authors](/cgspace-notes/2018/01/openrefine-authors.png) + +- Apply corrections using [fix-metadata-values.py](https://gist.github.com/alanorth/df92cbfb54d762ba21b28f7cd83b6897): + +``` +$ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu' +``` + +- In looking at some of the values to delete or check I found some metadata values that I could not resolve their handle via SQL: + +``` +dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali'; + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id +-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------ + 2757936 | 4369 | 3 | Tarawali | | 9 | | 600 | 2 +(1 row) + +dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369'; + handle +-------- +(0 rows) +``` + +- Even searching in the DSpace advanced search for author equals "Tarawali" produces nothing... +- Otherwise, the [DSpace 5 SQL Helper Functions](https://wiki.duraspace.org/display/DSPACE/Helper+SQL+functions+for+DSpace+5) provide `ds5_item2itemhandle()`, which is much easier than my long query above that I always have to go search for +- For example, to find the Handle for an item that has the author "Erni": + +``` +dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni'; + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id +-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------ + 2612150 | 70308 | 3 | Erni | | 9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 | -1 | 2 +(1 row) +dspace=# select ds5_item2itemhandle(70308); + ds5_item2itemhandle +--------------------- + 10568/68609 +(1 row) +``` + +- Next I apply the author deletions: + +``` +$ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu' +``` + +- Now working on the affiliation corrections from Peter: + +``` +$ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu' +$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu' +``` + +- Now I made a new list of affiliations for Peter to look through: + +``` +dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv; +COPY 4552 +``` + +- Looking over the affiliations again I see dozens of CIAT ones with their affiliation formatted like: International Center for Tropical Agriculture (CIAT) +- For example, this one is from just last month: https://cgspace.cgiar.org/handle/10568/89930 +- Our controlled vocabulary has this in the format without the abbreviation: International Center for Tropical Agriculture +- So some submitters don't know to use the controlled vocabulary lookup diff --git a/public/2018-01/index.html b/public/2018-01/index.html index 856cfe8b4..0074e26ea 100644 --- a/public/2018-01/index.html +++ b/public/2018-01/index.html @@ -92,7 +92,7 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv - + @@ -194,9 +194,9 @@ Danny wrote to ask for help renewing the wildcard ilri.org certificate and I adv "@type": "BlogPosting", "headline": "January, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-01/", - "wordCount": "2965", + "wordCount": "3592", "datePublished": "2018-01-02T08:35:54-08:00", - "dateModified": "2018-01-12T07:55:01+02:00", + "dateModified": "2018-01-13T18:04:45+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -911,6 +911,110 @@ Caused by: java.lang.NullPointerException
  • I’ll comment on that issue
  • +

    2018-01-14

    + + + +

    2018-01-15

    + + + +
    update metadatavalue set text_value='Formally Published' where resource_type_id=2 and metadata_field_id=214 and text_value like 'Formally published';
    +delete from metadatavalue where resource_type_id=2 and metadata_field_id=214 and text_value like 'NO';
    +update metadatavalue set text_value='en' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(En|English)';
    +update metadatavalue set text_value='fr' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(fre|frn|French)';
    +update metadatavalue set text_value='es' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(Spanish|spa)';
    +update metadatavalue set text_value='vi' where resource_type_id=2 and metadata_field_id=38 and text_value='Vietnamese';
    +update metadatavalue set text_value='ru' where resource_type_id=2 and metadata_field_id=38 and text_value='Ru';
    +update metadatavalue set text_value='in' where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(IN|In)';
    +delete from metadatavalue where resource_type_id=2 and metadata_field_id=38 and text_value ~ '(dc.language.iso|CGIAR Challenge Program on Water and Food)';
    +
    + + + +

    OpenRefine Authors

    + + + +
    $ ./fix-metadata-values.py -i /tmp/2018-01-14-Authors-1300-Corrections.csv -f dc.contributor.author -t correct -m 3 -d dspace-u dspace -p 'fuuu'
    +
    + + + +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Tarawali';
    + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id
    +-------------------+-------------+-------------------+------------+-----------+-------+-----------+------------+------------------
    +           2757936 |        4369 |                 3 | Tarawali   |           |     9 |           |        600 |                2
    +(1 row)
    +
    +dspace=# select handle from item, handle where handle.resource_id = item.item_id AND item.item_id = '4369';
    + handle
    +--------
    +(0 rows)
    +
    + + + +
    dspace=# select * from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value='Erni';
    + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place |              authority               | confidence | resource_type_id 
    +-------------------+-------------+-------------------+------------+-----------+-------+--------------------------------------+------------+------------------
    +           2612150 |       70308 |                 3 | Erni       |           |     9 | 3fe10c68-6773-49a7-89cc-63eb508723f2 |         -1 |                2
    +(1 row)
    +dspace=# select ds5_item2itemhandle(70308);
    + ds5_item2itemhandle 
    +---------------------
    + 10568/68609
    +(1 row)
    +
    + + + +
    $ ./delete-metadata-values.py -i /tmp/2018-01-14-Authors-5-Deletions.csv -f dc.contributor.author -m 3 -d dspace -u dspace -p 'fuuu'
    +
    + + + +
    $ ./fix-metadata-values.py -i /tmp/2018-01-15-Affiliations-888-Corrections.csv -f cg.contributor.affiliation -t correct -m 211 -d dspace -u dspace -p 'fuuu'
    +$ ./delete-metadata-values.py -i /tmp/2018-01-15-Affiliations-11-Deletions.csv -f cg.contributor.affiliation -m 211 -d dspace -u dspace -p 'fuuu'
    +
    + + + +
    dspace=# \copy (select distinct text_value, count(*) as count from metadatavalue where metadata_field_id = (select metadata_field_id from metadatafieldregistry where metadata_schema_id = 2 and element = 'contributor' and qualifier = 'affiliation') AND resource_type_id = 2 group by text_value order by count desc) to /tmp/affiliations.csv with csv;
    +COPY 4552
    +
    + + + diff --git a/public/2018/01/openrefine-authors.png b/public/2018/01/openrefine-authors.png new file mode 100644 index 000000000..bc94a8487 Binary files /dev/null and b/public/2018/01/openrefine-authors.png differ diff --git a/public/sitemap.xml b/public/sitemap.xml index 3f3f0797f..26e751854 100644 --- a/public/sitemap.xml +++ b/public/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-01/ - 2018-01-12T07:55:01+02:00 + 2018-01-13T18:04:45+02:00 @@ -144,7 +144,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-01-12T07:55:01+02:00 + 2018-01-13T18:04:45+02:00 0 @@ -155,7 +155,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-01-12T07:55:01+02:00 + 2018-01-13T18:04:45+02:00 0 @@ -167,13 +167,13 @@ https://alanorth.github.io/cgspace-notes/post/ - 2018-01-12T07:55:01+02:00 + 2018-01-13T18:04:45+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-01-12T07:55:01+02:00 + 2018-01-13T18:04:45+02:00 0 diff --git a/static/2018/01/openrefine-authors.png b/static/2018/01/openrefine-authors.png new file mode 100644 index 000000000..bc94a8487 Binary files /dev/null and b/static/2018/01/openrefine-authors.png differ