diff --git a/content/posts/2018-05.md b/content/posts/2018-05.md index b7cc94c8e..d884686a1 100644 --- a/content/posts/2018-05.md +++ b/content/posts/2018-05.md @@ -278,4 +278,11 @@ ga('send', 'pageview', { - Testing reconciliation of countries against Solr via conciliator, I notice that `CÔTE D'IVOIRE` doesn't match `COTE D'IVOIRE`, whereas with reconcile-csv it does - Also, when reconciling regions against Solr via conciliator `EASTERN AFRICA` doesn't match `EAST AFRICA`, whereas with reconcile-csv it does - And `SOUTH AMERICA` matches both `SOUTH ASIA` and `SOUTH AMERICA` with the same match score of 2... WTF. -- It could be that I just need to tune the index or query filters in Solr (currently using the example `text_en` field type) +- It could be that I just need to tune the query filter in Solr (currently using the example `text_en` field type) +- Oh sweet, it turns out that the issue with searching for characters with accents is called "code folding" in Solr +- You can use either a [`solr.ASCIIFoldingFilterFactory` filter](https://lucene.apache.org/solr/guide/7_3/language-analysis.html) or a [`solr.MappingCharFilterFactory` charFilter](https://lucene.apache.org/solr/guide/7_3/charfilterfactories.html) mapping against `mapping-FoldToASCII.txt` +- Also see: https://opensourceconnections.com/blog/2017/02/20/solr-utf8/ +- Now `CÔTE D'IVOIRE` matches `COTE D'IVOIRE`! +- I'm not sure which method is better, perhaps the `solr.ASCIIFoldingFilterFactory` filter because it doesn't require copying the `mapping-FoldToASCII.txt` file +- And actually I'm not entirely sure about the order of filtering before tokenizing, etc... +- Ah, I see that `charFilter` must be before the tokenizer because it works on a stream, whereas `filter` operates on tokenized input so it must come after the tokenizer diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index b09ca0057..834e1580b 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -27,7 +27,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked - + @@ -65,9 +65,9 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked "@type": "BlogPosting", "headline": "May, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-05/", - "wordCount": "2164", + "wordCount": "2267", "datePublished": "2018-05-01T16:43:54+03:00", - "dateModified": "2018-05-17T09:45:45+03:00", + "dateModified": "2018-05-17T10:51:46+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -461,7 +461,14 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
  • Testing reconciliation of countries against Solr via conciliator, I notice that CÔTE D'IVOIRE doesn’t match COTE D'IVOIRE, whereas with reconcile-csv it does
  • Also, when reconciling regions against Solr via conciliator EASTERN AFRICA doesn’t match EAST AFRICA, whereas with reconcile-csv it does
  • And SOUTH AMERICA matches both SOUTH ASIA and SOUTH AMERICA with the same match score of 2… WTF.
  • -
  • It could be that I just need to tune the index or query filters in Solr (currently using the example text_en field type)
  • +
  • It could be that I just need to tune the query filter in Solr (currently using the example text_en field type)
  • +
  • Oh sweet, it turns out that the issue with searching for characters with accents is called “code folding” in Solr
  • +
  • You can use either a solr.ASCIIFoldingFilterFactory filter or a solr.MappingCharFilterFactory charFilter mapping against mapping-FoldToASCII.txt
  • +
  • Also see: https://opensourceconnections.com/blog/2017/02/20/solr-utf8/
  • +
  • Now CÔTE D'IVOIRE matches COTE D'IVOIRE!
  • +
  • I’m not sure which method is better, perhaps the solr.ASCIIFoldingFilterFactory filter because it doesn’t require copying the mapping-FoldToASCII.txt file
  • +
  • And actually I’m not entirely sure about the order of filtering before tokenizing, etc…
  • +
  • Ah, I see that charFilter must be before the tokenizer because it works on a stream, whereas filter operates on tokenized input so it must come after the tokenizer
  • diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 6aed80250..5cee460f8 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ https://alanorth.github.io/cgspace-notes/2018-05/ - 2018-05-17T09:45:45+03:00 + 2018-05-17T10:51:46+03:00 @@ -164,7 +164,7 @@ https://alanorth.github.io/cgspace-notes/ - 2018-05-17T09:45:45+03:00 + 2018-05-17T10:51:46+03:00 0 @@ -175,7 +175,7 @@ https://alanorth.github.io/cgspace-notes/tags/notes/ - 2018-05-17T09:45:45+03:00 + 2018-05-17T10:51:46+03:00 0 @@ -187,13 +187,13 @@ https://alanorth.github.io/cgspace-notes/posts/ - 2018-05-17T09:45:45+03:00 + 2018-05-17T10:51:46+03:00 0 https://alanorth.github.io/cgspace-notes/tags/ - 2018-05-17T09:45:45+03:00 + 2018-05-17T10:51:46+03:00 0