From a01175f5ce90e1f1d9d07d4af2620b9c855208b9 Mon Sep 17 00:00:00 2001 From: Alan Orth Date: Wed, 11 Jan 2017 18:19:57 +0200 Subject: [PATCH] Update notes for 2017-01-11 --- content/post/2017-01.md | 32 +++++++++++++++++++++++++++++++ public/2017-01/index.html | 38 ++++++++++++++++++++++++++++++++++++- public/index.xml | 36 +++++++++++++++++++++++++++++++++++ public/post/index.xml | 36 +++++++++++++++++++++++++++++++++++ public/tags/notes/index.xml | 36 +++++++++++++++++++++++++++++++++++ 5 files changed, 177 insertions(+), 1 deletion(-) diff --git a/content/post/2017-01.md b/content/post/2017-01.md index db887f1bf..969682b33 100644 --- a/content/post/2017-01.md +++ b/content/post/2017-01.md @@ -123,3 +123,35 @@ dspace=# delete from collection2item where item_id = '80596' and id not in (9079 ## 2017-01-11 - Maria found another item with duplicate mappings: https://cgspace.cgiar.org/handle/10568/78658 +- Error in `fix-metadata-values.py` when it tries to print the value for Entwicklung & Ländlicher Raum: + +``` +Traceback (most recent call last): + File "./fix-metadata-values.py", line 80, in + print("Fixing {} occurences of: {}".format(records_to_fix, record[0])) +UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128) +``` + +- Seems we need to encode as UTF-8 before printing to screen, ie: + +``` +print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8'))) +``` + +- See: http://stackoverflow.com/a/36427358/487333 +- I'm actually not sure if we need to encode() the strings to UTF-8 before writing them to the database... I've never had this issue before +- Now back to cleaning up some journal titles so we can make the controlled vocabulary: + +``` +$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace-u dspace-p 'fuuu' +``` + +- Now get the top 500 journal titles: + +``` +dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by cou +nt desc limit 500) to /tmp/journal-titles.csv with csv; +``` + +- The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November +- I will have to go through these and fix some more before making the controlled vocabulary diff --git a/public/2017-01/index.html b/public/2017-01/index.html index 6d1d0e78a..0f8c94efc 100644 --- a/public/2017-01/index.html +++ b/public/2017-01/index.html @@ -28,7 +28,7 @@ - + @@ -244,6 +244,42 @@ Caused by: java.net.SocketException: Broken pipe (Write failed) + +
Traceback (most recent call last):
+  File "./fix-metadata-values.py", line 80, in <module>
+    print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
+UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
+
+ + + +
print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
+
+ + + +
$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace-u dspace-p 'fuuu'
+
+ + + +
dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by cou
+nt desc limit 500) to /tmp/journal-titles.csv with csv;
+
+ + diff --git a/public/index.xml b/public/index.xml index a4f638224..f310b1a26 100644 --- a/public/index.xml +++ b/public/index.xml @@ -151,6 +151,42 @@ Caused by: java.net.SocketException: Broken pipe (Write failed) <ul> <li>Maria found another item with duplicate mappings: <a href="https://cgspace.cgiar.org/handle/10568/78658">https://cgspace.cgiar.org/handle/10568/78658</a></li> +<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li> +</ul> + +<pre><code>Traceback (most recent call last): + File &quot;./fix-metadata-values.py&quot;, line 80, in &lt;module&gt; + print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0])) +UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128) +</code></pre> + +<ul> +<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li> +</ul> + +<pre><code>print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0].encode('utf-8'))) +</code></pre> + +<ul> +<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li> +<li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li> +<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li> +</ul> + +<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace-u dspace-p 'fuuu' +</code></pre> + +<ul> +<li>Now get the top 500 journal titles:</li> +</ul> + +<pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by cou +nt desc limit 500) to /tmp/journal-titles.csv with csv; +</code></pre> + +<ul> +<li>The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November</li> +<li>I will have to go through these and fix some more before making the controlled vocabulary</li> </ul> diff --git a/public/post/index.xml b/public/post/index.xml index 51ee0835d..a8d6f7442 100644 --- a/public/post/index.xml +++ b/public/post/index.xml @@ -151,6 +151,42 @@ Caused by: java.net.SocketException: Broken pipe (Write failed) <ul> <li>Maria found another item with duplicate mappings: <a href="https://cgspace.cgiar.org/handle/10568/78658">https://cgspace.cgiar.org/handle/10568/78658</a></li> +<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li> +</ul> + +<pre><code>Traceback (most recent call last): + File &quot;./fix-metadata-values.py&quot;, line 80, in &lt;module&gt; + print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0])) +UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128) +</code></pre> + +<ul> +<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li> +</ul> + +<pre><code>print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0].encode('utf-8'))) +</code></pre> + +<ul> +<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li> +<li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li> +<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li> +</ul> + +<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace-u dspace-p 'fuuu' +</code></pre> + +<ul> +<li>Now get the top 500 journal titles:</li> +</ul> + +<pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by cou +nt desc limit 500) to /tmp/journal-titles.csv with csv; +</code></pre> + +<ul> +<li>The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November</li> +<li>I will have to go through these and fix some more before making the controlled vocabulary</li> </ul> diff --git a/public/tags/notes/index.xml b/public/tags/notes/index.xml index 26bdf4adc..6867691f5 100644 --- a/public/tags/notes/index.xml +++ b/public/tags/notes/index.xml @@ -150,6 +150,42 @@ Caused by: java.net.SocketException: Broken pipe (Write failed) <ul> <li>Maria found another item with duplicate mappings: <a href="https://cgspace.cgiar.org/handle/10568/78658">https://cgspace.cgiar.org/handle/10568/78658</a></li> +<li>Error in <code>fix-metadata-values.py</code> when it tries to print the value for Entwicklung &amp; Ländlicher Raum:</li> +</ul> + +<pre><code>Traceback (most recent call last): + File &quot;./fix-metadata-values.py&quot;, line 80, in &lt;module&gt; + print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0])) +UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128) +</code></pre> + +<ul> +<li>Seems we need to encode as UTF-8 before printing to screen, ie:</li> +</ul> + +<pre><code>print(&quot;Fixing {} occurences of: {}&quot;.format(records_to_fix, record[0].encode('utf-8'))) +</code></pre> + +<ul> +<li>See: <a href="http://stackoverflow.com/a/36427358/487333">http://stackoverflow.com/a/36427358/487333</a></li> +<li>I&rsquo;m actually not sure if we need to encode() the strings to UTF-8 before writing them to the database&hellip; I&rsquo;ve never had this issue before</li> +<li>Now back to cleaning up some journal titles so we can make the controlled vocabulary:</li> +</ul> + +<pre><code>$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace-u dspace-p 'fuuu' +</code></pre> + +<ul> +<li>Now get the top 500 journal titles:</li> +</ul> + +<pre><code>dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by cou +nt desc limit 500) to /tmp/journal-titles.csv with csv; +</code></pre> + +<ul> +<li>The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November</li> +<li>I will have to go through these and fix some more before making the controlled vocabulary</li> </ul>