Changing Bits: SuggestStopFilter carefully removes stop words for suggesters

Wednesday, August 14, 2013

SuggestStopFilter carefully removes stop words for suggesters

Lucene now has a nice set of suggesters that use an analyzer to tokenize the suggestions: AnalyzingSuggester, FuzzySuggester and AnalyzingInfixSuggester. Using an analyzer is powerful because it lets you customize exactly how suggestions are matched: you can normalize case, apply stemming, match across different synonym forms, etc.

One of the most common things you'll do with your analyzer is to remove stop-words using StopFilter. Unfortunately, if you try this, you'll quickly notice that the stop filter is too aggressive because it happily removes the last token even if the user isn't done typing it yet. For example if the user has typed "a", you'd expect suggestions like apple, aardvark, etc., but you won't get that because StopFilter removed the "a" token.

You could try using StopFilter only while indexing, which was my first attempt with the suggestions at jirasearch.mikemccandless.com, but then, at least for AnalyzingInfixSuggester, you'll fail to get matches when you pass allTermsRequired=true because the suggester then requires that even stop words find matches.

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won't change it. This way a query "a" can find "apple", but a query "a " (with a trailing space) will find nothing because the "a" will be removed.

I've pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

10 comments:

Artem LukaninAugust 14, 2013 at 3:02 PM
Great! I've created a similar filter, but without offsets (so I don't know whether there is a space after the word). The problem was that there are some expressions, which has a lot of stop words before normal words like "think about to be or not to be a Solr committer". If there is such a document and you try to type in this expression from the beginning to the end, you get very strange suggestions until you type "Solr". I tested it with an AnalyzingSuggester in Solr. Hope, I can test your new filter, when there is a factory for AnalyzingInfixSuggester in Solr.
ReplyDelete
Replies
Dmitry KanAugust 15, 2013 at 3:44 AM
This sounds a great addition, Mike!

I was long ago wondering if it is possible to keep stop words inside a token n-gram sequence and remove them on the boundaries. Would you have any ideas in this area too?

For reference, the question on stackoverflow (suggestions done with other means, but the principle remains):

http://stackoverflow.com/questions/4954735/autocomplete-via-shingles-and-termvector-component

ReplyDelete
Replies
Michael McCandlessAugust 15, 2013 at 2:31 PM
Hi Dmitry,

Couldn't you make a custom stop filter that only removed stop words at the start (first token(s) seen) or end of the input (no non-stopword tokens seen after)? It'd required some buffering / state keeping (capture/restorteState) but it seem doable?
ReplyDelete
Replies
J.L. HillMay 9, 2014 at 12:38 PM
I assumed this was in the standard distribution of Solr4.5+, but no? I tried
<filter class="org.apache.lucene.search.suggest.analyzing.SuggestStopFilter"
but I get a plugin init failure. The filter looks quite helpful. Thanks.
ReplyDelete
Replies
UnknownJanuary 8, 2015 at 12:33 PM
Hi Michale,

Thanks for very much informative post.
do we have any config analyzer or filter in solr to stop suggesting the phrase that ends with stopwords?

For ex:
If the suggestion are as below for query http://localhost/solr/suggest?q=jazz+a
"suggestion": [
"jazz and",
"jazz at",
"jazz at lincoln",
"jazz at lincoln center",
"jazz artists",
"jazz and classic"
]
Is there any config or solution to remove only "jazz at" and "jazz and" phrases so that the final suggestion response looks more sensible!

"suggestion": [
"jazz at lincoln",
"jazz at lincoln center",
"jazz artists",
"jazz and classic"
]

Google does this intelligently :)

I have tested with StopFilterFactory and SuggestStopFilter both of which does not this.

Do i have to come up with a custom plugin to do this in solr?

Thanks,
Rajesh.
ReplyDelete
Replies

Add comment