Update notes for 2017-01-11

2025-01-27 05:49:12 +01:00 · 2017-01-11 18:19:57 +02:00
parent b15d1f7892
commit a01175f5ce
5 changed files with 177 additions and 1 deletions
--- a/content/post/2017-01.md
+++ b/content/post/2017-01.md
@ -123,3 +123,35 @@ dspace=# delete from collection2item where item_id = '80596' and id not in (9079
 ## 2017-01-11

 - Maria found another item with duplicate mappings: https://cgspace.cgiar.org/handle/10568/78658
+- Error in `fix-metadata-values.py` when it tries to print the value for Entwicklung & Ländlicher Raum:
+
+```
+Traceback (most recent call last):
+  File "./fix-metadata-values.py", line 80, in <module>
+    print("Fixing {} occurences of: {}".format(records_to_fix, record[0]))
+UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 15: ordinal not in range(128)
+```
+
+- Seems we need to encode as UTF-8 before printing to screen, ie:
+
+```
+print("Fixing {} occurences of: {}".format(records_to_fix, record[0].encode('utf-8')))
+```
+
+- See: http://stackoverflow.com/a/36427358/487333
+- I'm actually not sure if we need to encode() the strings to UTF-8 before writing them to the database... I've never had this issue before
+- Now back to cleaning up some journal titles so we can make the controlled vocabulary:
+
+```
+$ ./fix-metadata-values.py -i /tmp/fix-27-journal-titles.csv -f dc.source -t correct -m 55 -d dspace-u dspace-p 'fuuu'
+```
+
+- Now get the top 500 journal titles:
+
+```
+dspace-# \copy (select distinct text_value, count(*) from metadatavalue where resource_type_id=2 and metadata_field_id=55 group by text_value order by cou
+nt desc limit 500) to /tmp/journal-titles.csv with csv;
+```
+
+- The values are a bit dirty and outdated, since the file I had given to Abenet and Peter was from November
+- I will have to go through these and fix some more before making the controlled vocabulary