Wednesday, August 14, 2013

SuggestStopFilter carefully removes stop words for suggesters

Lucene now has a nice set of suggesters that use an analyzer to tokenize the suggestions: AnalyzingSuggester, FuzzySuggester and AnalyzingInfixSuggester. Using an analyzer is powerful because it lets you customize exactly how suggestions are matched: you can normalize case, apply stemming, match across different synonym forms, etc.

One of the most common things you'll do with your analyzer is to remove stop-words using StopFilter. Unfortunately, if you try this, you'll quickly notice that the stop filter is too aggressive because it happily removes the last token even if the user isn't done typing it yet. For example if the user has typed "a", you'd expect suggestions like apple, aardvark, etc., but you won't get that because StopFilter removed the "a" token.

You could try using StopFilter only while indexing, which was my first attempt with the suggestions at jirasearch.mikemccandless.com, but then, at least for AnalyzingInfixSuggester, you'll fail to get matches when you pass allTermsRequired=true because the suggester then requires that even stop words find matches.

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won't change it. This way a query "a" can find "apple", but a query "a " (with a trailing space) will find nothing because the "a" will be removed.

I've pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

5 comments:

  1. Great! I've created a similar filter, but without offsets (so I don't know whether there is a space after the word). The problem was that there are some expressions, which has a lot of stop words before normal words like "think about to be or not to be a Solr committer". If there is such a document and you try to type in this expression from the beginning to the end, you get very strange suggestions until you type "Solr". I tested it with an AnalyzingSuggester in Solr. Hope, I can test your new filter, when there is a factory for AnalyzingInfixSuggester in Solr.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  2. This sounds a great addition, Mike!

    I was long ago wondering if it is possible to keep stop words inside a token n-gram sequence and remove them on the boundaries. Would you have any ideas in this area too?

    For reference, the question on stackoverflow (suggestions done with other means, but the principle remains):

    http://stackoverflow.com/questions/4954735/autocomplete-via-shingles-and-termvector-component

    ReplyDelete
  3. Hi Dmitry,

    Couldn't you make a custom stop filter that only removed stop words at the start (first token(s) seen) or end of the input (no non-stopword tokens seen after)? It'd required some buffering / state keeping (capture/restorteState) but it seem doable?

    ReplyDelete
    Replies
    1. Hi Mike,

      That sounds like a reasonable idea. Thanks!

      Delete