May, 2018

Tue May 01, 2018 by Alan Orth in Notes

2018-05-01

I cleared the Solr statistics core on DSpace Test by issuing two commands directly to the Solr admin interface:
- http://localhost:3000/solr/statistics/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E
- http://localhost:3000/solr/statistics/update?stream.body=%3Ccommit/%3E
Then I reduced the JVM heap size from 6144 back to 5120m
Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked the Ansible infrastructure scripts to support hosts choosing which distribution they want to use

2018-05-02

Advise Fabio Fidanza about integrating CGSpace content in the new CGIAR corporate website
I think they can mostly rely on using the cg.contributor.crp field
Looking over some IITA records for Sisay
- Other than trimming and collapsing consecutive whitespace, I made some other corrections
- I need to check the correct formatting of COTE D’IVOIRE vs COTE D’IVOIRE
- I replaced all DOIs with HTTPS
- I checked a few DOIs and found at least one that was missing, so I Googled the title of the paper and found the correct DOI
- Also, I found an FAQ for DOI that says the dx.doi.org syntax is older, so I will replace all the DOIs with doi.org instead
- I found five records with “ISI Jounal” instead of “ISI Journal”
- I found one item with IITA subject “.”
- Need to remember to check the facets for things like this in sponsorship:
- Deutsche Gesellschaft für Internationale Zusammenarbeit
- Deutsche Gesellschaft fur Internationale Zusammenarbeit
- Eight records with language “fn” instead of “fr”
- One incorrect type (lowercase “proceedings”): Conference proceedings
- Found some capitalized CRPs in cg.contributor.crp
- Found some incorrect author affiliations, ie “Institut de Recherche pour le Developpement Agricolc” should be “Institut de Recherche pour le Developpement Agricole“
- Wow, and for sponsors there are the following:
- Incorrect: Flemish Agency for Development Cooperation and Technical Assistance
- Incorrect: Flemish Organization for Development Cooperation and Technical Assistance
- Correct: Flemish Association for Development Cooperation and Technical Assistance
- One item had region “WEST” (I corrected it to “WEST AFRICA”)

2018-05-03

It turns out that the IITA records that I was helping Sisay with in March were imported in 2018-04 without a final check by Abenet or I
There are lots of errors on language, CRP, and even some encoding errors on abstract fields
I export them and include the hidden metadata fields like dc.date.accessioned so I can filter the ones from 2018-04 and correct them in Open Refine:

$ dspace metadata-export -a -f /tmp/iita.csv -i 10568/68616

Abenet sent a list of 46 ORCID identifiers for ILRI authors so I need to get their names using my resolve-orcids.py script and merge them into our controlled vocabulary
On the messed up IITA records from 2018-04 I see sixty DOIs in incorrect format (cg.identifier.doi)

2018-05-06

Fixing the IITA records from Sisay, sixty DOIs have completely invalid format like http:dx.doi.org10.1016j.cropro.2008.07.003
I corrected all the DOIs and then checked them for validity with a quick bash loop:

$ for line in $(< /tmp/links.txt); do echo $line; http --print h $line; done

Most of the links are good, though one is duplicate and one seems to even be incorrect in the publisher’s site so…
Also, there are some duplicates:
- 10568/92241 and 10568/92230 (same DOI)
- 10568/92151 and 10568/92150 (same ISBN)
- 10568/92291 and 10568/92286 (same citation, title, authors, year)
Messed up abstracts:
- 10568/92309
Fixed some issues in regions, countries, sponsors, ISSN, and cleaned whitespace errors from citation, abstract, author, and titles
Fixed all issues with CRPs
A few more interesting Unicode characters to look for in text fields like author, abstracts, and citations might be: ’ (0x2019), · (0x00b7), and € (0x20ac)
A custom text facit in OpenRefine with this GREL expression could be a good for finding invalid characters or encoding errors in authors, abstracts, etc:

or(
  isNotNull(value.match(/.*[(|)].*/)),
  isNotNull(value.match(/.*\uFFFD.*/)),
  isNotNull(value.match(/.*\u00A0.*/)),
  isNotNull(value.match(/.*\u200A.*/)),
  isNotNull(value.match(/.*\u2019.*/)),
  isNotNull(value.match(/.*\u00b7.*/)),
  isNotNull(value.match(/.*\u20ac.*/))
)

I found some more IITA records that Sisay imported on 2018-03-23 that have invalid CRP names, so now I kinda want to check those ones!
Combine the ORCID identifiers Abenet sent with our existing list and resolve their names using the resolve-orcids.py script:

$ cat ~/src/git/DSpace/dspace/config/controlled-vocabularies/cg-creator-id.xml /tmp/ilri-orcids.txt | grep -oE '[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}' | sort | uniq > /tmp/2018-05-06-combined.txt
$ ./resolve-orcids.py -i /tmp/2018-05-06-combined.txt -o /tmp/2018-05-06-combined-names.txt -d
# sort names, copy to cg-creator-id.xml, add XML formatting, and then format with tidy (preserving accents)
$ tidy -xml -utf8 -iq -m -w 0 dspace/config/controlled-vocabularies/cg-creator-id.xml

I made a pull request (#373) for this that I’ll merge some time next week (I’m expecting Atmire to get back to us about DSpace 5.8 soon)
After testing quickly I just decided to merge it, and I noticed that I don’t even need to restart Tomcat for the changes to get loaded

2018-05-07

I spent a bit of time playing with conciliator and Solr, trying to figure out how to reconcile columns in OpenRefine with data in our existing Solr cores (like CRP subjects)
The documentation regarding the Solr stuff is limited, and I cannot figure out what all the fields in conciliator.properties are supposed to be
But then I found reconcile-csv, which allows you to reconcile against values in a CSV file!
That, combined with splitting our multi-value fields on “||” in OpenRefine is amaaaaazing, because after reconciliation you can just join them again
Oh wow, you can also facet on the individual values once you’ve split them! That’s going to be amazing for proofing CRPs, subjects, etc.

2018-05-09

Udana asked about the Book Chapters we had been proofing on DSpace Test in 2018-04
I told him that there were still some TODO items for him on that data, for example to update the dc.language.iso field for the Spanish items
I was trying to remember how I parsed the input-forms.xml using xmllint to extract subjects neatly
I could use it with reconcile-csv or to populate a Solr instance for reconciliation
This XPath expression gets close, but outputs all items on one line:

$ xmllint --xpath '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/node()' dspace/config/input-forms.xml        
Agriculture for Nutrition and HealthBig DataClimate Change, Agriculture and Food SecurityExcellence in BreedingFishForests, Trees and AgroforestryGenebanksGrain Legumes and Dryland CerealsLivestockMaizePolicies, Institutions and MarketsRiceRoots, Tubers and BananasWater, Land and EcosystemsWheatAquatic Agricultural SystemsDryland CerealsDryland SystemsGrain LegumesIntegrated Systems for the Humid TropicsLivestock and Fish

Maybe xmlstarlet is better:

$ xmlstarlet sel -t -v '//value-pairs[@value-pairs-name="crpsubject"]/pair/stored-value/text()' dspace/config/input-forms.xml
Agriculture for Nutrition and Health
Big Data
Climate Change, Agriculture and Food Security
Excellence in Breeding
Fish
Forests, Trees and Agroforestry
Genebanks
Grain Legumes and Dryland Cereals
Livestock
Maize
Policies, Institutions and Markets
Rice
Roots, Tubers and Bananas
Water, Land and Ecosystems
Wheat
Aquatic Agricultural Systems
Dryland Cereals
Dryland Systems
Grain Legumes
Integrated Systems for the Humid Tropics
Livestock and Fish

Discuss Colombian BNARS harvesting the CIAT data from CGSpace
They are using a system called Primo and the only options for data harvesting in that system are via FTP and OAI
I told them to get all CIAT records via OAI
Just a note to myself, I figured out how to get reconcile-csv to run from source rather than running the old pre-compiled JAR file:

$ lein run /tmp/crps.csv id

I tried to reconcile against a CSV of our countries but reconcile-csv crashes

2018-05-13

It turns out there was a space in my “country” header that was causing reconcile-csv to crash
After removing that it works fine!
Looking at Sisay’s 2,640 CIFOR records on DSpace Test (¹⁰⁵⁶⁸⁄₉₂₉₀₄)
- Trimmed all leading / trailing white space and condensed multiple spaces into one
- Corrected DOIs to use HTTPS and “doi.org” instead of “dx.doi.org”
- There are eight items in cg.identifier.doi that are not DOIs)
- Corrected cg.identifier.url links to cifor.org to use HTTPS
- Corrected dc.language.iso from vt to vi (Vietnamese)
- Corrected affiliations to not use acronyms
- Reconcile countries against our countries list (removing terms like LATIN AMERICA, CENTRAL AFRICA, etc that are not countries)
- Reconcile regions against our list of regions

2018-05-14

Send a message to the OpenRefine mailing list about the bug with reconciling multi-value cells
Help Silvia Alonso get a list of all her publications since 2013 from Listings and Reports

2018-05-15

Turns out I was doing the OpenRefine reconciliation wrong: I needed to copy the matched values to a new column!
Also, I learned how to do something cool with Jython expressions in OpenRefine
This will fetch a URL and return its HTTP response code:

import urllib2
import re

pattern = re.compile('.*10.1016.*')
if pattern.match(value):
  get = urllib2.urlopen(value)
  return get.getcode()

return "blank"

I used a regex to limit it to just some of the DOIs in this case because there were thousands of URLs
Here the response code would be 200, 404, etc, or “blank” if there is no URL for that item
You could use this in a facet or in a new column
More information and good examples here: https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine
Finish looking at the 2,640 CIFOR records on DSpace Test (¹⁰⁵⁶⁸⁄₉₂₉₀₄), cleaning up authors and adding collection mappings
They can now be moved to CGSpace as far as I’m concerned, but I don’t know if Sisay will do it or me
I was checking the CIFOR data for duplicates using Atmire’s Metadata Quality Module (and found some duplicates actually), but then DSpace died…
I didn’t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in dmest -T:

[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
[Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
[Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So the Linux kernel killed Java…
Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace:

Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission

Looking in the DSpace log I see something related:

2018-05-15 12:35:30,858 INFO  org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060

So I’m not sure…
I finally figured out how to get OpenRefine to reconcile values from Solr via conciliator:
The trick was to use a more appropriate Solr fieldType text_en instead of text_general so that more terms match, for example uppercase and lower case:

$ ./bin/solr start
$ ./bin/solr create_core -c countries
$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema

It still doesn’t catch simple mistakes like “ALBANI” or “AL BANIA” for “ALBANIA”, and it doesn’t return scores, so I have to select matches manually:

OpenRefine reconciling countries from local Solr

I should probably make a general copy field and set it to be the default search field, like DSpace’s search core does (see schema.xml):

<defaultSearchField>search_text</defaultSearchField>
...
<copyField source="*" dest="search_text"/>

Actually, I wonder how much of their schema I could just copy…
Apparently the default search field is the df parameter and you could technically just add it to the query string, so no need to bother with that in the schema now
I copied over the DSpace search_text field type from the DSpace Solr config (had to remove some properties so Solr would start) but it doesn’t seem to be any better at matching than the text_en type
I think I need to focus on trying to return scores with conciliator

2018-05-16

Discuss GDPR with James Stapleton
- As far as I see it, we are “Data Controllers” on CGSpace because we store peoples’ names, emails, and phone numbers if they register
- We set cookies on the user’s computer, but these do not contain personally identifiable information (PII) and they are “session” cookies which are deleted when the user closes their browser
- We use Google Analytics to track website usage, which makes Google the “Data Processor” and in this case we merely need to limit or obfuscate the information we send to them
- As the only personally identifiable information we send is the user’s IP address, I think we only need to enable IP Address Anonymization in our analytics.js code snippets
- Then we can add a “Privacy” page to CGSpace that makes all of this clear