Adding PorterStemFilter and FuzzyQuery to all searches - Front- & Back-End Development - Front- & Back-End Development - Progress Community
 Front- & Back-End Development

Adding PorterStemFilter and FuzzyQuery to all searches

  • Adding PorterStemFilter and FuzzyQuery to all searches
  • I've followed the guidance for using ASCIIFoldingFilter, but want it to use the PorterStemFilter instead. This, in theory, should let me search for things like "tester" and get all results for "test" if it's been Tokenized correctly during indexing and during the search. When debugging I see the my CustomLuceneAnalyzer code being hit during indexing and during search, but nothing changes in the results... 

    Once I get that working, I'd like to additionally find a way to use Telerik.Sitefinity.Utilities.Lucene.Net.Search.FuzzyQuery instead of the default(?) MultiTermQuery class. Is that something that's possible via Global.asax class overriding?

    I'm currently using SF 9.2 and this article as the base of my Custom Analyzer testing:

    The differences I've had to implement versus that article are that the Lucene documentation said to use lower case tokenizers, and I found that was available within SF. Using StandardTokenizer vs. LowerCaseTokenizer didn't change the results, however.

                TokenStream stream = new LowerCaseTokenizer(reader);
                stream = new PorterStemFilter(stream);
                stream = new StandardFilter(stream);
                return stream;

    Is this something that's ever been done before? If not, and in a completely different direction, how would someone get these two specific search capabilities in SF?

    Let me know if there's any more info you need to replicate this. Thanks!

  • Okay, I've also confirmed that even though the TokenStream code is being hit during indexing and searching, nothing happens with just the basic ASCIIFoldingFilter example either, rendering 0 results for 'tëst' (vs 232 for 'test') or any other variation with other accented characters after being indexed. So I must be doing something fundamentally wrong. I've also tried setting the overridesTokenStreamMethod = true during my constructor in case it needed that variable set to know how to handle things.  I'm about to spin up a fresh copy of 9.2 and try again, but based on the original example not working, I'm losing hope in being able to move forward without official help.

  • Okay, day 3 now. Still moving forward. I've gotten a brand new 9.2 setup with a couple of simplified test pages containing only words like "ranched ranching" but not "ranch", and some with "tést", "tésting", etc. And behold, I got the ASCII Folding Filter to show me the "tést" page when I search for "test" with the original code provided!

    I then got HALF of the PorterStemFilter working as expected. Based on repeated testing it's definitely stemming the words correctly when indexing. It's NOT stemming the actual search terms, though. So searching for "ranch" gets pages with "ranching" and "ranched", but now searching for "ranching" shows 0 results even though that's one of the terms directly on a page (which shows it must have been indexed as "ranch").

    So I went back and checked and with the ASCII Folding Filter it does the exact same thing. If I have "tést" on a page, I can search for "test" and find it, but searching for "tést" directly gives no results. At some point since the article I'm following was written, the Search Query tokenizer and the Search Indexing tokenizer must have been split apart...

    I mean, I think at this point I might be able to move forward with doing some hacky search term replacement via custom Regex stemming and fuzzy matching algorithms (loading up potential misspellings and stemmed copies of search terms onto the end of the search query manually), but it's going to be ugly. If you guys could point me to where I can apply these Lucene filters to the search query terms list programmatically in 9.2, that would be awesome.

    Same goes for any samples of using the FuzzyQuery at either the time of indexing OR the parsing of search query terms. I have a feeling only one of those hooks might be necessary to get past misspellings, but I'd rather start with some code you guys have seen working at least once in the past.