Update notes for 2018-02-25

2025-01-27 05:49:12 +01:00 · 2018-02-25 17:25:41 +02:00
parent 5750294974
commit e0dbe07c3b
4 changed files with 112 additions and 14 deletions
--- a/content/post/2018-02.md
+++ b/content/post/2018-02.md
@@ -847,3 +847,51 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
 - Remove Dryland Systems subject from submission form because that CRP closed two years ago ([#355](https://github.com/ilri/DSpace/pull/355))
 - Run all system updates on DSpace Test
 - Email ICT to ask how to proceed with the OCS proforma issue for the new DSpace Test server on Linode
+- Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace
+- We have over 60,000 unique author + authority combinations on CGSpace:
+
+```
+dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
+ count 
+-------
+ 62464
+(1 row)
+```
+
+- I know from earlier this month that there are only 624 unique ORCID identifiers in the Solr authority core, so it's way easier to just fetch the unique ORCID iDs from Solr and then go back to PostgreSQL and do the metadata mapping that way
+- The query in Solr would simply be `orcid_id:*`
+- Assuming I know that authority record with `id:d7ef744b-bbd4-4171-b449-00e37e1b776f`, then I could query PostgreSQL for all metadata records using that authority:
+
+```
+dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
+ metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
+-------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
+           2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
+(1 row)
+```
+
+- Then I suppose I can use the `resource_id` to identify the item?
+- Actually, `resource_id` is the same id we use in CSV, so I could simply build something like this for a metadata import!
+
+```
+id,cg.creator.id
+93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
+```
+
+- I just discovered that [requests-cache](https://requests-cache.readthedocs.io) can transparently cache HTTP requests
+- Running `resolve-orcids.py` with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time!
+
+```
+$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
+Ali Ramadhan: 0000-0001-5019-1368
+Alan S. Orth: 0000-0002-1735-7458
+Ibrahim Mohammed: 0000-0001-5199-5528
+Nor Azwadi: 0000-0001-9634-1958
+./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.32s user 0.07s system 3% cpu 10.530 total
+$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
+Ali Ramadhan: 0000-0001-5019-1368
+Alan S. Orth: 0000-0002-1735-7458
+Ibrahim Mohammed: 0000-0001-5199-5528
+Nor Azwadi: 0000-0001-9634-1958
+./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.23s user 0.05s system 8% cpu 3.046 total
+```
--- a/docs/2018-02/index.html
+++ b/docs/2018-02/index.html
@@ -23,7 +23,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-pl

 <meta property="article:published_time" content="2018-02-01T16:28:54&#43;02:00"/>

-<meta property="article:modified_time" content="2018-02-25T11:23:54&#43;02:00"/>
+<meta property="article:modified_time" content="2018-02-25T13:28:26&#43;02:00"/>



@@ -57,9 +57,9 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-pl
  "@type": "BlogPosting",
  "headline": "February, 2018",
  "url": "https://alanorth.github.io/cgspace-notes/2018-02/",
-  "wordCount": "5219",
+  "wordCount": "5516",
  "datePublished": "2018-02-01T16:28:54&#43;02:00",
-  "dateModified": "2018-02-25T11:23:54&#43;02:00",
+  "dateModified": "2018-02-25T13:28:26&#43;02:00",
  "author": {
    "@type": "Person",
    "name": "Alan Orth"
@@ -1074,8 +1074,58 @@ Given Names Deactivated Family Name Deactivated: 0000-0002-2614-426X
 <li>Remove Dryland Systems subject from submission form because that CRP closed two years ago (<a href="https://github.com/ilri/DSpace/pull/355">#355</a>)</li>
 <li>Run all system updates on DSpace Test</li>
 <li>Email ICT to ask how to proceed with the OCS proforma issue for the new DSpace Test server on Linode</li>
+<li>Thinking about how to preserve ORCID identifiers attached to existing items in CGSpace</li>
+<li>We have over 60,000 unique author + authority combinations on CGSpace:</li>
 </ul>

+<pre><code>dspace=# select count(distinct (text_value, authority)) from metadatavalue where resource_type_id=2 and metadata_field_id=3;
+ count 
+-------
+ 62464
+(1 row)
+</code></pre>
+
+<ul>
+<li>I know from earlier this month that there are only 624 unique ORCID identifiers in the Solr authority core, so it&rsquo;s way easier to just fetch the unique ORCID iDs from Solr and then go back to PostgreSQL and do the metadata mapping that way</li>
+<li>The query in Solr would simply be <code>orcid_id:*</code></li>
+<li>Assuming I know that authority record with <code>id:d7ef744b-bbd4-4171-b449-00e37e1b776f</code>, then I could query PostgreSQL for all metadata records using that authority:</li>
+</ul>
+
+<pre><code>dspace=# select * from metadatavalue where resource_type_id=2 and authority='d7ef744b-bbd4-4171-b449-00e37e1b776f';
+ metadata_value_id | resource_id | metadata_field_id |        text_value         | text_lang | place |              authority               | confidence | resource_type_id 
+-------------------+-------------+-------------------+---------------------------+-----------+-------+--------------------------------------+------------+------------------
+           2726830 |       77710 |                 3 | Rodríguez Chalarca, Jairo |           |     2 | d7ef744b-bbd4-4171-b449-00e37e1b776f |        600 |                2
+(1 row)
+</code></pre>
+
+<ul>
+<li>Then I suppose I can use the <code>resource_id</code> to identify the item?</li>
+<li>Actually, <code>resource_id</code> is the same id we use in CSV, so I could simply build something like this for a metadata import!</li>
+</ul>
+
+<pre><code>id,cg.creator.id
+93848,Alan S. Orth: 0000-0002-1735-7458||Peter G. Ballantyne: 0000-0001-9346-2893
+</code></pre>
+
+<ul>
+<li>I just discovered that <a href="https://requests-cache.readthedocs.io">requests-cache</a> can transparently cache HTTP requests</li>
+<li>Running <code>resolve-orcids.py</code> with my test input takes 10.5 seconds the first time, and then 3.0 seconds the second time!</li>
+</ul>
+
+<pre><code>$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
+Ali Ramadhan: 0000-0001-5019-1368
+Alan S. Orth: 0000-0002-1735-7458
+Ibrahim Mohammed: 0000-0001-5199-5528
+Nor Azwadi: 0000-0001-9634-1958
+./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.32s user 0.07s system 3% cpu 10.530 total
+$ time ./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names
+Ali Ramadhan: 0000-0001-5019-1368
+Alan S. Orth: 0000-0002-1735-7458
+Ibrahim Mohammed: 0000-0001-5199-5528
+Nor Azwadi: 0000-0001-9634-1958
+./resolve-orcids.py -i orcid-test-values.txt -o /tmp/orcid-names  0.23s user 0.05s system 8% cpu 3.046 total
+</code></pre>
+
  

  
--- a/docs/robots.txt
+++ b/docs/robots.txt
@@ -32,7 +32,7 @@ Disallow: /cgspace-notes/2015-12/
 Disallow: /cgspace-notes/2015-11/
 Disallow: /cgspace-notes/
 Disallow: /cgspace-notes/categories/
-Disallow: /cgspace-notes/tags/notes/
 Disallow: /cgspace-notes/categories/notes/
+Disallow: /cgspace-notes/tags/notes/
 Disallow: /cgspace-notes/post/
 Disallow: /cgspace-notes/tags/
--- a/docs/sitemap.xml
+++ b/docs/sitemap.xml
@@ -4,7 +4,7 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/2018-02/</loc>
-    <lastmod>2018-02-25T11:23:54+02:00</lastmod>
+    <lastmod>2018-02-25T13:28:26+02:00</lastmod>
  </url>
  
  <url>
@@ -149,7 +149,7 @@
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/</loc>
-    <lastmod>2018-02-25T11:23:54+02:00</lastmod>
+    <lastmod>2018-02-25T13:28:26+02:00</lastmod>
    <priority>0</priority>
  </url>
  
@@ -158,27 +158,27 @@
    <priority>0</priority>
  </url>
  
-  <url>
-    <loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
-    <lastmod>2018-02-25T11:23:54+02:00</lastmod>
-    <priority>0</priority>
-  </url>
-  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
    <lastmod>2017-09-28T12:00:49+03:00</lastmod>
    <priority>0</priority>
  </url>
  
+  <url>
+    <loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
+    <lastmod>2018-02-25T13:28:26+02:00</lastmod>
+    <priority>0</priority>
+  </url>
+  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/post/</loc>
-    <lastmod>2018-02-25T11:23:54+02:00</lastmod>
+    <lastmod>2018-02-25T13:28:26+02:00</lastmod>
    <priority>0</priority>
  </url>
  
  <url>
    <loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
-    <lastmod>2018-02-25T11:23:54+02:00</lastmod>
+    <lastmod>2018-02-25T13:28:26+02:00</lastmod>
    <priority>0</priority>
  </url>