diff --git a/content/posts/2018-05.md b/content/posts/2018-05.md index f90c1b61d..b7ecdac25 100644 --- a/content/posts/2018-05.md +++ b/content/posts/2018-05.md @@ -162,7 +162,7 @@ $ lein run /tmp/crps.csv id - It turns out there was a space in my "country" header that was causing reconcile-csv to crash - After removing that it works fine! -- Looking at Sisay's 2,000 CIFOR records on DSpace Test ([10568/92904](https://dspacetest.cgiar.org/handle/10568/92904)) +- Looking at Sisay's 2,640 CIFOR records on DSpace Test ([10568/92904](https://dspacetest.cgiar.org/handle/10568/92904)) - Trimmed all leading / trailing white space and condensed multiple spaces into one - Corrected DOIs to use HTTPS and "doi.org" instead of "dx.doi.org" - There are eight items in `cg.identifier.doi` that are not DOIs) @@ -171,3 +171,32 @@ $ lein run /tmp/crps.csv id - Corrected affiliations to not use acronyms - Reconcile countries against our countries list (removing terms like LATIN AMERICA, CENTRAL AFRICA, etc that are not countries) - Reconcile regions against our list of regions + +## 2018-05-14 + +- Send a message to the OpenRefine mailing list about the bug with reconciling multi-value cells + +## 2018-05-15 + +- Turns out I was doing the OpenRefine reconciliation wrong: I needed to copy the matched values to a new column! +- Also, I learned how to do something cool with Jython expressions in OpenRefine +- This will fetch a URL and return its HTTP response code: + +``` +import urllib2 +import re + +pattern = re.compile('.*10.1016.*') +if pattern.match(value): + get = urllib2.urlopen(value) + return get.getcode() + +return "blank" +``` + +- I used a regex to limit it to just some of the DOIs in this case because there were thousands of URLs +- Here the response code would be 200, 404, etc, or "blank" if there is no URL for that item +- You could use this in a facet or in a new column +- More information and good examples here: https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine +- Finish looking at the 2,640 CIFOR records on DSpace Test ([10568/92904](https://dspacetest.cgiar.org/handle/10568/92904)), cleaning up authors and adding collection mappings +- They can now be moved to CGSpace as far as I'm concerned, but I don't know if Sisay will do it or me diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index 1985e55ab..815604af6 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -27,7 +27,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked - + @@ -65,9 +65,9 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked "@type": "BlogPosting", "headline": "May, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-05/", - "wordCount": "1263", + "wordCount": "1441", "datePublished": "2018-05-01T16:43:54+03:00", - "dateModified": "2018-05-10T14:41:37+03:00", + "dateModified": "2018-05-13T18:30:25+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -322,7 +322,7 @@ Livestock and Fish
import urllib2
+import re
+
+pattern = re.compile('.*10.1016.*')
+if pattern.match(value):
+ get = urllib2.urlopen(value)
+ return get.getcode()
+
+return "blank"
+
+
+