Update notes

2025-01-27 05:49:12 +01:00 · 2018-05-17 12:37:21 +03:00
parent bccb8b20fe
commit 232693d9d3
3 changed files with 24 additions and 10 deletions
--- a/content/posts/2018-05.md
+++ b/content/posts/2018-05.md
@ -278,4 +278,11 @@ ga('send', 'pageview', {
 - Testing reconciliation of countries against Solr via conciliator, I notice that `CÔTE D'IVOIRE` doesn't match `COTE D'IVOIRE`, whereas with reconcile-csv it does
 - Also, when reconciling regions against Solr via conciliator `EASTERN AFRICA` doesn't match `EAST AFRICA`, whereas with reconcile-csv it does
 - And `SOUTH AMERICA` matches both `SOUTH ASIA` and `SOUTH AMERICA` with the same match score of 2... WTF.
- It could be that I just need to tune the index or query filters in Solr (currently using the example `text_en` field type)
+- It could be that I just need to tune the query filter in Solr (currently using the example `text_en` field type)
+- Oh sweet, it turns out that the issue with searching for characters with accents is called "code folding" in Solr
+- You can use either a [`solr.ASCIIFoldingFilterFactory` filter](https://lucene.apache.org/solr/guide/7_3/language-analysis.html) or a [`solr.MappingCharFilterFactory` charFilter](https://lucene.apache.org/solr/guide/7_3/charfilterfactories.html) mapping against `mapping-FoldToASCII.txt`
+- Also see: https://opensourceconnections.com/blog/2017/02/20/solr-utf8/
+- Now `CÔTE D'IVOIRE` matches `COTE D'IVOIRE`!
+- I'm not sure which method is better, perhaps the `solr.ASCIIFoldingFilterFactory` filter because it doesn't require copying the `mapping-FoldToASCII.txt` file
+- And actually I'm not entirely sure about the order of filtering before tokenizing, etc...
+- Ah, I see that `charFilter` must be before the tokenizer because it works on a stream, whereas `filter` operates on tokenized input so it must come after the tokenizer