diff --git a/content/posts/2018-05.md b/content/posts/2018-05.md index b7ecdac25..d80585a08 100644 --- a/content/posts/2018-05.md +++ b/content/posts/2018-05.md @@ -175,6 +175,7 @@ $ lein run /tmp/crps.csv id ## 2018-05-14 - Send a message to the OpenRefine mailing list about the bug with reconciling multi-value cells +- Help Silvia Alonso get a list of all her publications since 2013 from Listings and Reports ## 2018-05-15 @@ -200,3 +201,52 @@ return "blank" - More information and good examples here: https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine - Finish looking at the 2,640 CIFOR records on DSpace Test ([10568/92904](https://dspacetest.cgiar.org/handle/10568/92904)), cleaning up authors and adding collection mappings - They can now be moved to CGSpace as far as I'm concerned, but I don't know if Sisay will do it or me +- I was checking the CIFOR data for duplicates using Atmire's Metadata Quality Module (and found some duplicates actually), but then DSpace died... +- I didn't see anything in the Tomcat, DSpace, or Solr logs, but I saw this in `dmest -T`: + +``` +[Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child +[Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB +[Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB +``` + +- So the Linux kernel killed Java... +- Maria from Bioversity mailed to say she got an error while submitting an item on CGSpace: + +``` +Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission +``` + +- Looking in the DSpace log I see something related: + +``` +2018-05-15 12:35:30,858 INFO org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060 +``` + +- So I'm not sure... +- I finally figured out how to get OpenRefine to reconcile values from Solr via [conciliator](https://github.com/codeforkjeff/conciliator): +- The trick was to use a more appropriate Solr fieldType `text_en` instead of `text_general` so that more terms match, for example uppercase and lower case: + +``` +$ ./bin/solr start +$ ./bin/solr create_core -c countries +$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv +$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema +``` + +- It still doesn't catch simple mistakes like "ALBANI" or "AL BANIA" for "ALBANIA", and it doesn't return scores, so I have to select matches manually: + +![OpenRefine reconciling countries from local Solr](/cgspace-notes/2018/05/openrefine-solr-conciliator.png) + +- I should probably make a general copy field and set it to be the default search field, like DSpace's search core does (see schema.xml): + +``` +search_text +... + +``` + +- Actually, I wonder how much of their schema I could just copy... +- Apparently the default search field is the `df` parameter and you could technically just add it to the query string, so no need to bother with that in the schema now +- I copied over the DSpace `search_text` field type from the DSpace Solr config (had to remove some properties so Solr would start) but it doesn't seem to be any better at matching than the `text_en` type +- I think I need to focus on trying to return scores with conciliator diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index 815604af6..42ba57122 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -27,7 +27,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked - + @@ -65,9 +65,9 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked "@type": "BlogPosting", "headline": "May, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-05/", - "wordCount": "1441", + "wordCount": "1811", "datePublished": "2018-05-01T16:43:54+03:00", - "dateModified": "2018-05-13T18:30:25+03:00", + "dateModified": "2018-05-15T13:25:03+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -340,6 +340,7 @@ Livestock and Fish

2018-05-15

@@ -368,6 +369,62 @@ return "blank"
  • More information and good examples here: https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine
  • Finish looking at the 2,640 CIFOR records on DSpace Test (1056892904), cleaning up authors and adding collection mappings
  • They can now be moved to CGSpace as far as I’m concerned, but I don’t know if Sisay will do it or me
  • +
  • I was checking the CIFOR data for duplicates using Atmire’s Metadata Quality Module (and found some duplicates actually), but then DSpace died…
  • +
  • I didn’t see anything in the Tomcat, DSpace, or Solr logs, but I saw this in dmest -T:
  • + + +
    [Tue May 15 12:10:01 2018] Out of memory: Kill process 3763 (java) score 706 or sacrifice child
    +[Tue May 15 12:10:01 2018] Killed process 3763 (java) total-vm:14667688kB, anon-rss:5705268kB, file-rss:0kB, shmem-rss:0kB
    +[Tue May 15 12:10:01 2018] oom_reaper: reaped process 3763 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    +
    + + + +
    Unable to load Submission Information, since WorkspaceID (ID:S96060) is not a valid in-process submission
    +
    + + + +
    2018-05-15 12:35:30,858 INFO  org.dspace.submit.step.CompleteStep @ m.garruccio@cgiar.org:session_id=8AC4499945F38B45EF7A1226E3042DAE:submission_complete:Completed submission with id=96060
    +
    + + + +
    $ ./bin/solr start
    +$ ./bin/solr create_core -c countries
    +$ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
    +$ curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"country", "type":"text_en", "multiValued":false, "stored":true}}' http://localhost:8983/solr/countries/schema
    +
    + + + +

    OpenRefine reconciling countries from local Solr

    + + + +
    <defaultSearchField>search_text</defaultSearchField>
    +...
    +<copyField source="*" dest="search_text"/>
    +
    + + diff --git a/docs/2018/05/openrefine-solr-conciliator.png b/docs/2018/05/openrefine-solr-conciliator.png new file mode 100644 index 000000000..369b4278a Binary files /dev/null and b/docs/2018/05/openrefine-solr-conciliator.png differ diff --git a/docs/sitemap.xml b/docs/sitemap.xml index dec806107..9e8615e44 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-05/ - 2018-05-13T18:30:25+03:00 + 2018-05-15T13:25:03+03:00 @@ -164,7 +164,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-05-13T18:30:25+03:00 + 2018-05-15T13:25:03+03:00 0 @@ -175,7 +175,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-05-13T18:30:25+03:00 + 2018-05-15T13:25:03+03:00 0 @@ -187,13 +187,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2018-05-13T18:30:25+03:00 + 2018-05-15T13:25:03+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-05-13T18:30:25+03:00 + 2018-05-15T13:25:03+03:00 0 diff --git a/static/2018/05/openrefine-solr-conciliator.png b/static/2018/05/openrefine-solr-conciliator.png new file mode 100644 index 000000000..369b4278a Binary files /dev/null and b/static/2018/05/openrefine-solr-conciliator.png differ