From e0dbe07c3b2970bac9cdec1a7375d39fd260d730 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Sun, 25 Feb 2018 17:25:41 +0200 Subject: [PATCH] Update notes for 2018-02-25 --- content/post/2018-02.md | 48 +++++++++++++++++++++++++++++++++++ docs/2018-02/index.html | 56 ++++++++++++++++++++++++++++++++++++++--- docs/robots.txt | 2 +- docs/sitemap.xml | 20 +++++++-------- 4 files changed, 112 insertions(+), 14 deletions(-) diff --git a/content/post/2018-02.md b/content/post/2018-02.md index 6821822a6..7184a3027 100644 --- a/content/post/2018-02.md +++ b/content/post/2018-02.md @@ -847,3 +847,51 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X - Remove Dryland Systems subject from submission form because that CRP closed two years ago ([#355](https://github.com/ilri/DSpace/pull/355)) - Run all system updates on DSpace Test - Email ICT to ask how to proceed with the OCS proforma issue for the new DSpace Test server on Linode +- Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace +- We have over 60,000 unique author + authority combinations on CGSpace: + +``` +dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3; + count +------- + 62464 +(1 row) +``` + +- I know from earlier this month that there are only 624 unique ORCID identifiers in the Solr authority core, so it's way easier to just fetch the unique ORCID iDs from Solr and then go back to PostgreSQL and do the metadata mapping that way +- The query in Solr would simply be `orcid_id:*` +- Assuming I know that authority record with `id:d7ef744b-bbd4-4171-b449-00e37e1b776f`, then I could query PostgreSQL for all metadata records using that authority: + +``` +dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f'; + metadata_value_id | resource_id | metadata_field_id | text_value | text_lang | place | authority | confidence | resource_type_id +-------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------ + 2726830 | 77710 | 3 | Rodríguez Chalarca, Jairo | | 2 | d7ef744b-bbd4-4171-b449-00e37e1b776f | 600 | 2 +(1 row) +``` + +- Then I suppose I can use the `resource_id` to identify the item? +- Actually, `resource_id` is the same id we use in CSV, so I could simply build something like this for a metadata import! + +``` +id,cg.creator.id +93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893 +``` + +- I just discovered that [requests-cache](https://requests-cache.readthedocs.io) can transparently cache HTTP requests +- Running `resolve-orcids.py` with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time! + +``` +$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names +Ali Ramadhan: 0000-0001-5019-1368 +Alan S. Orth: 0000-0002-1735-7458 +Ibrahim Mohammed: 0000-0001-5199-5528 +Nor Azwadi: 0000-0001-9634-1958 +./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names 0.32s user 0.07s system 3% cpu 10.530 total +$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names +Ali Ramadhan: 0000-0001-5019-1368 +Alan S. Orth: 0000-0002-1735-7458 +Ibrahim Mohammed: 0000-0001-5199-5528 +Nor Azwadi: 0000-0001-9634-1958 +./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names 0.23s user 0.05s system 8% cpu 3.046 total +``` diff --git a/docs/2018-02/index.html b/docs/2018-02/index.html index ab8872c00..5b8a222d8 100644 --- a/docs/2018-02/index.html +++ b/docs/2018-02/index.html @@ -23,7 +23,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl - + @@ -57,9 +57,9 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl "@type": "BlogPosting", "headline": "February, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-02/", - "wordCount": "5219", + "wordCount": "5516", "datePublished": "2018-02-01T16:28:54+02:00", - "dateModified": "2018-02-25T11:23:54+02:00", + "dateModified": "2018-02-25T13:28:26+02:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -1074,8 +1074,58 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
  • Remove Dryland Systems subject from submission form because that CRP closed two years ago (#355)
  • Run all system updates on DSpace Test
  • Email ICT to ask how to proceed with the OCS proforma issue for the new DSpace Test server on Linode
  • +
  • Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace
  • +
  • We have over 60,000 unique author + authority combinations on CGSpace:
  • +
    dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
    + count 
    +-------
    + 62464
    +(1 row)
    +
    + + + +
    dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
    + metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
    +-------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
    +           2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
    +(1 row)
    +
    + + + +
    id,cg.creator.id
    +93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
    +
    + + + +
    $ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
    +Ali Ramadhan: 0000-0001-5019-1368
    +Alan S. Orth: 0000-0002-1735-7458
    +Ibrahim Mohammed: 0000-0001-5199-5528
    +Nor Azwadi: 0000-0001-9634-1958
    +./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.32s user 0.07s system 3% cpu 10.530 total
    +$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
    +Ali Ramadhan: 0000-0001-5019-1368
    +Alan S. Orth: 0000-0002-1735-7458
    +Ibrahim Mohammed: 0000-0001-5199-5528
    +Nor Azwadi: 0000-0001-9634-1958
    +./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.23s user 0.05s system 8% cpu 3.046 total
    +
    + diff --git a/docs/robots.txt b/docs/robots.txt index 12f3bc2f9..86c3f15b8 100644 --- a/docs/robots.txt +++ b/docs/robots.txt @@ -32,7 +32,7 @@ Disallow: /cgspace-notes/2015-12/ Disallow: /cgspace-notes/2015-11/ Disallow: /cgspace-notes/ Disallow: /cgspace-notes/categories/ -Disallow: /cgspace-notes/tags/notes/ Disallow: /cgspace-notes/categories/notes/ +Disallow: /cgspace-notes/tags/notes/ Disallow: /cgspace-notes/post/ Disallow: /cgspace-notes/tags/ diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 3b56565dd..de5e9a61e 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-02/ - 2018-02-25T11:23:54+02:00 + 2018-02-25T13:28:26+02:00 @@ -149,7 +149,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-02-25T11:23:54+02:00 + 2018-02-25T13:28:26+02:00 0 @@ -158,27 +158,27 @@ 0 - - https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-02-25T11:23:54+02:00 - 0 - - https://alanorth.github.io/cgspace-notes/categories/notes/ 2017-09-28T12:00:49+03:00 0 + + https://alanorth.github.io/cgspace-notes/tags/notes/ + 2018-02-25T13:28:26+02:00 + 0 + + https://alanorth.github.io/cgspace-notes/post/ - 2018-02-25T11:23:54+02:00 + 2018-02-25T13:28:26+02:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-02-25T11:23:54+02:00 + 2018-02-25T13:28:26+02:00 0