diff --git a/content/posts/2018-05.md b/content/posts/2018-05.md index b7cc94c8e..d884686a1 100644 --- a/content/posts/2018-05.md +++ b/content/posts/2018-05.md @@ -278,4 +278,11 @@ ga('send', 'pageview', { - Testing reconciliation of countries against Solr via conciliator, I notice that `CÔTE D'IVOIRE` doesn't match `COTE D'IVOIRE`, whereas with reconcile-csv it does - Also, when reconciling regions against Solr via conciliator `EASTERN AFRICA` doesn't match `EAST AFRICA`, whereas with reconcile-csv it does - And `SOUTH AMERICA` matches both `SOUTH ASIA` and `SOUTH AMERICA` with the same match score of 2... WTF. -- It could be that I just need to tune the index or query filters in Solr (currently using the example `text_en` field type) +- It could be that I just need to tune the query filter in Solr (currently using the example `text_en` field type) +- Oh sweet, it turns out that the issue with searching for characters with accents is called "code folding" in Solr +- You can use either a [`solr.ASCIIFoldingFilterFactory` filter](https://lucene.apache.org/solr/guide/7_3/language-analysis.html) or a [`solr.MappingCharFilterFactory` charFilter](https://lucene.apache.org/solr/guide/7_3/charfilterfactories.html) mapping against `mapping-FoldToASCII.txt` +- Also see: https://opensourceconnections.com/blog/2017/02/20/solr-utf8/ +- Now `CÔTE D'IVOIRE` matches `COTE D'IVOIRE`! +- I'm not sure which method is better, perhaps the `solr.ASCIIFoldingFilterFactory` filter because it doesn't require copying the `mapping-FoldToASCII.txt` file +- And actually I'm not entirely sure about the order of filtering before tokenizing, etc... +- Ah, I see that `charFilter` must be before the tokenizer because it works on a stream, whereas `filter` operates on tokenized input so it must come after the tokenizer diff --git a/docs/2018-05/index.html b/docs/2018-05/index.html index b09ca0057..834e1580b 100644 --- a/docs/2018-05/index.html +++ b/docs/2018-05/index.html @@ -27,7 +27,7 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked - + @@ -65,9 +65,9 @@ Also, I switched it to use OpenJDK instead of Oracle Java, as well as re-worked "@type": "BlogPosting", "headline": "May, 2018", "url": "https://alanorth.github.io/cgspace-notes/2018-05/", - "wordCount": "2164", + "wordCount": "2267", "datePublished": "2018-05-01T16:43:54+03:00", - "dateModified": "2018-05-17T09:45:45+03:00", + "dateModified": "2018-05-17T10:51:46+03:00", "author": { "@type": "Person", "name": "Alan Orth" @@ -461,7 +461,14 @@ $ ./bin/post -c countries ~/src/git/DSpace/2018-05-10-countries.csv
CÔTE D'IVOIRE
doesn’t match COTE D'IVOIRE
, whereas with reconcile-csv it doesEASTERN AFRICA
doesn’t match EAST AFRICA
, whereas with reconcile-csv it doesSOUTH AMERICA
matches both SOUTH ASIA
and SOUTH AMERICA
with the same match score of 2… WTF.text_en
field type)text_en
field type)solr.ASCIIFoldingFilterFactory
filter or a solr.MappingCharFilterFactory
charFilter mapping against mapping-FoldToASCII.txt
CÔTE D'IVOIRE
matches COTE D'IVOIRE
!solr.ASCIIFoldingFilterFactory
filter because it doesn’t require copying the mapping-FoldToASCII.txt
filecharFilter
must be before the tokenizer because it works on a stream, whereas filter
operates on tokenized input so it must come after the tokenizer