Add notes for 2018-02-11

This commit is contained in:
Alan Orth 2018-02-11 18:21:39 +02:00
parent d312304729
commit 3441bd7128
Signed by: alanorth
GPG Key ID: 0FB860CC9C45B1B9
4 changed files with 185 additions and 14 deletions

View File

@ -302,6 +302,25 @@ $ psql -c 'select * from pg_stat_activity' | grep -o -E '(dspaceWeb|dspaceApi|ds
- I cherry-picked all the commits for DS-3551 but it won't build on our current DSpace 5.5!
- I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle
## 2018-02-10
- I tried to disable ORCID lookups but keep the existing authorities
- This item has an ORCID for Ralf Kiese: http://localhost:8080/handle/10568/89897
- Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn't show up on the item
- Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
```
Field dc_contributor_author has choice presentation of type "select", it may NOT be authority-controlled.
```
- If I change choices.presentation to suggest it give this error:
```
xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
```
- So I don't think we can disable the ORCID lookup function and keep the ORCID badges
## 2018-02-11
- Magdalena from CCAFS emailed to ask why one of their items has such a weird thumbnail: [10568/90735](https://cgspace.cgiar.org/handle/10568/90735)
@ -315,3 +334,64 @@ $ convert CCAFS_WP_223.pdf\[0\] -profile /usr/local/share/ghostscript/9.22/iccpr
```
![Manual thumbnail](/cgspace-notes/2018/02/CCAFS_WP_223.jpg)
- Peter sent me corrected author names last week but the file encoding is messed up:
```
$ isutf8 authors-2018-02-05.csv
authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
```
- The `isutf8` program comes from `moreutils`
- Line 100 contains: Galiè, Alessandra
- In other news, psycopg2 is splitting their package in pip, so to install the binary wheel distribution you need to use `pip install psycopg2-binary`
- See: http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/
- I updated my `fix-metadata-values.py` and `delete-metadata-values.py` scripts on the scripts page: https://github.com/ilri/DSpace/wiki/Scripts
- I ran the 342 author corrections (after trimming whitespace and excluding those with `||` and other syntax errors) on CGSpace:
```
$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
```
- Then I ran a full Discovery re-indexing:
```
$ export JAVA_OPTS="-Dfile.encoding=UTF-8 -Xmx1024m"
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
```
- That reminds me that Bizu had asked me to fix some of Alan Duncan's names in December
- I see he actually has some variations with "Duncan, Alan J.": https://cgspace.cgiar.org/discover?filtertype_1=author&filter_relational_operator_1=contains&filter_1=Duncan%2C+Alan&submit_apply_filter=&query=
- I will just update those for her too and then restart the indexing:
```
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value | authority | confidence
-----------------+--------------------------------------+------------
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | 600
Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 | -1
Duncan, Alan J. | | -1
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 | -1
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d | -1
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | -1
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
(8 rows)
dspace=# begin;
dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
UPDATE 216
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value | authority | confidence
--------------+--------------------------------------+------------
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
(1 row)
dspace=# commit;
```
- Run all system updates on DSpace Test (linode02) and reboot it
- I wrote a Python script ([`resolve-orcids-from-solr.py`](https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b)) using SolrClient to parse the Solr authority cache for ORCID IDs
- We currently have 1562 authority records with ORCID IDs, and 624 unique IDs
- We can use this to build a controlled vocabulary of ORCID IDs for new item submissions
- I don't know how to add ORCID IDs to existing items yet... some more querying of PostgreSQL for authority values perhaps?
- I added the script to the [ILRI DSpace wiki on GitHub](https://github.com/ilri/DSpace/wiki/Scripts)

View File

@ -23,7 +23,7 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu’s munin-pl
<meta property="article:published_time" content="2018-02-01T16:28:54&#43;02:00"/>
<meta property="article:modified_time" content="2018-02-08T01:08:36&#43;02:00"/>
<meta property="article:modified_time" content="2018-02-11T10:01:13&#43;02:00"/>
@ -57,9 +57,9 @@ I copied the logic in the jmx_tomcat_dbpools provided by Ubuntu&rsquo;s munin-pl
"@type": "BlogPosting",
"headline": "February, 2018",
"url": "https://alanorth.github.io/cgspace-notes/2018-02/",
"wordCount": "2147",
"wordCount": "2666",
"datePublished": "2018-02-01T16:28:54&#43;02:00",
"dateModified": "2018-02-08T01:08:36&#43;02:00",
"dateModified": "2018-02-11T10:01:13&#43;02:00",
"author": {
"@type": "Person",
"name": "Alan Orth"
@ -455,6 +455,30 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<li>I sent a message to the dspace-tech mailing list asking why DSpace thinks these connections are busy when PostgreSQL says they are idle</li>
</ul>
<h2 id="2018-02-10">2018-02-10</h2>
<ul>
<li>I tried to disable ORCID lookups but keep the existing authorities</li>
<li>This item has an ORCID for Ralf Kiese: <a href="http://localhost:8080/handle/10568/89897">http://localhost:8080/handle/10568/89897</a></li>
<li>Switch authority.controlled off and change authorLookup to lookup, and the ORCID badge doesn&rsquo;t show up on the item</li>
<li>Leave all settings but change choices.presentation to lookup and ORCID badge is there and item submission uses LC Name Authority and it breaks with this error:
<br /></li>
</ul>
<pre><code>Field dc_contributor_author has choice presentation of type &quot;select&quot;, it may NOT be authority-controlled.
</code></pre>
<ul>
<li>If I change choices.presentation to suggest it give this error:</li>
</ul>
<pre><code>xmlui.mirage2.forms.instancedCompositeFields.noSuggestionError
</code></pre>
<ul>
<li>So I don&rsquo;t think we can disable the ORCID lookup function and keep the ORCID badges</li>
</ul>
<h2 id="2018-02-11">2018-02-11</h2>
<ul>
@ -472,6 +496,73 @@ $ grep 46.229.168 dspace.log.2018-02-07 | grep -o -E 'session_id=[A-Z0-9]{32}' |
<p><img src="/cgspace-notes/2018/02/CCAFS_WP_223.jpg" alt="Manual thumbnail" /></p>
<ul>
<li>Peter sent me corrected author names last week but the file encoding is messed up:</li>
</ul>
<pre><code>$ isutf8 authors-2018-02-05.csv
authors-2018-02-05.csv: line 100, char 18, byte 4179: After a first byte between E1 and EC, expecting the 2nd byte between 80 and BF.
</code></pre>
<ul>
<li>The <code>isutf8</code> program comes from <code>moreutils</code></li>
<li>Line 100 contains: Galiè, Alessandra</li>
<li>In other news, psycopg2 is splitting their package in pip, so to install the binary wheel distribution you need to use <code>pip install psycopg2-binary</code></li>
<li>See: <a href="http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/">http://initd.org/psycopg/articles/2018/02/08/psycopg-274-released/</a></li>
<li>I updated my <code>fix-metadata-values.py</code> and <code>delete-metadata-values.py</code> scripts on the scripts page: <a href="https://github.com/ilri/DSpace/wiki/Scripts">https://github.com/ilri/DSpace/wiki/Scripts</a></li>
<li>I ran the 342 author corrections (after trimming whitespace and excluding those with <code>||</code> and other syntax errors) on CGSpace:</li>
</ul>
<pre><code>$ ./fix-metadata-values.py -i Correct-342-Authors-2018-02-11.csv -f dc.contributor.author -t correct -m 3 -d dspace -u dspace -p 'fuuu'
</code></pre>
<ul>
<li>Then I ran a full Discovery re-indexing:</li>
</ul>
<pre><code>$ export JAVA_OPTS=&quot;-Dfile.encoding=UTF-8 -Xmx1024m&quot;
$ time schedtool -D -e ionice -c2 -n7 nice -n19 dspace index-discovery -b
</code></pre>
<ul>
<li>That reminds me that Bizu had asked me to fix some of Alan Duncan&rsquo;s names in December</li>
<li>I see he actually has some variations with &ldquo;Duncan, Alan J.&rdquo;: <a href="https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=">https://cgspace.cgiar.org/discover?filtertype_1=author&amp;filter_relational_operator_1=contains&amp;filter_1=Duncan%2C+Alan&amp;submit_apply_filter=&amp;query=</a></li>
<li>I will just update those for her too and then restart the indexing:</li>
</ul>
<pre><code>dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value | authority | confidence
-----------------+--------------------------------------+------------
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | 600
Duncan, Alan J. | 62298c84-4d9d-4b83-a932-4a9dd4046db7 | -1
Duncan, Alan J. | | -1
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
Duncan, Alan J. | cd0e03bf-92c3-475f-9589-60c5b042ea60 | -1
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d | -1
Duncan, Alan J. | 5ff35043-942e-4d0a-b377-4daed6e3c1a3 | -1
Duncan, Alan J. | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
(8 rows)
dspace=# begin;
dspace=# update metadatavalue set text_value='Duncan, Alan', authority='a6486522-b08a-4f7a-84f9-3a73ce56034d', confidence=600 where resource_type_id=2 and metadata_field_id=3 and text_value like 'Duncan, Alan%';
UPDATE 216
dspace=# select distinct text_value, authority, confidence from metadatavalue where resource_type_id=2 and metadata_field_id=3 and text_value like '%Duncan, Alan%';
text_value | authority | confidence
--------------+--------------------------------------+------------
Duncan, Alan | a6486522-b08a-4f7a-84f9-3a73ce56034d | 600
(1 row)
dspace=# commit;
</code></pre>
<ul>
<li>Run all system updates on DSpace Test (linode02) and reboot it</li>
<li>I wrote a Python script (<a href="https://gist.github.com/alanorth/57a88379126d844563c1410bd7b8d12b"><code>resolve-orcids-from-solr.py</code></a>) using SolrClient to parse the Solr authority cache for ORCID IDs</li>
<li>We currently have 1562 authority records with ORCID IDs, and 624 unique IDs</li>
<li>We can use this to build a controlled vocabulary of ORCID IDs for new item submissions</li>
<li>I don&rsquo;t know how to add ORCID IDs to existing items yet&hellip; some more querying of PostgreSQL for authority values perhaps?</li>
<li>I added the script to the <a href="https://github.com/ilri/DSpace/wiki/Scripts">ILRI DSpace wiki on GitHub</a></li>
</ul>

View File

@ -32,7 +32,7 @@ Disallow: /cgspace-notes/2015-12/
Disallow: /cgspace-notes/2015-11/
Disallow: /cgspace-notes/
Disallow: /cgspace-notes/categories/
Disallow: /cgspace-notes/categories/notes/
Disallow: /cgspace-notes/tags/notes/
Disallow: /cgspace-notes/categories/notes/
Disallow: /cgspace-notes/post/
Disallow: /cgspace-notes/tags/

View File

@ -4,7 +4,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/2018-02/</loc>
<lastmod>2018-02-08T01:08:36+02:00</lastmod>
<lastmod>2018-02-11T10:01:13+02:00</lastmod>
</url>
<url>
@ -149,7 +149,7 @@
<url>
<loc>https://alanorth.github.io/cgspace-notes/</loc>
<lastmod>2018-02-08T01:08:36+02:00</lastmod>
<lastmod>2018-02-11T10:01:13+02:00</lastmod>
<priority>0</priority>
</url>
@ -158,27 +158,27 @@
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2018-02-11T10:01:13+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/categories/notes/</loc>
<lastmod>2017-09-28T12:00:49+03:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/notes/</loc>
<lastmod>2018-02-08T01:08:36+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/post/</loc>
<lastmod>2018-02-08T01:08:36+02:00</lastmod>
<lastmod>2018-02-11T10:01:13+02:00</lastmod>
<priority>0</priority>
</url>
<url>
<loc>https://alanorth.github.io/cgspace-notes/tags/</loc>
<lastmod>2018-02-08T01:08:36+02:00</lastmod>
<lastmod>2018-02-11T10:01:13+02:00</lastmod>
<priority>0</priority>
</url>