Wednesday, August 14, 2013

SuggestStopFilter carefully removes stop words for suggesters

Lucene now has a nice set of suggesters that use an analyzer to tokenize the suggestions: AnalyzingSuggester, FuzzySuggester and AnalyzingInfixSuggester. Using an analyzer is powerful because it lets you customize exactly how suggestions are matched: you can normalize case, apply stemming, match across different synonym forms, etc.

One of the most common things you'll do with your analyzer is to remove stop-words using StopFilter. Unfortunately, if you try this, you'll quickly notice that the stop filter is too aggressive because it happily removes the last token even if the user isn't done typing it yet. For example if the user has typed "a", you'd expect suggestions like apple, aardvark, etc., but you won't get that because StopFilter removed the "a" token.

You could try using StopFilter only while indexing, which was my first attempt with the suggestions at jirasearch.mikemccandless.com, but then, at least for AnalyzingInfixSuggester, you'll fail to get matches when you pass allTermsRequired=true because the suggester then requires that even stop words find matches.

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won't change it. This way a query "a" can find "apple", but a query "a " (with a trailing space) will find nothing because the "a" will be removed.

I've pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

10 comments:

  1. Great! I've created a similar filter, but without offsets (so I don't know whether there is a space after the word). The problem was that there are some expressions, which has a lot of stop words before normal words like "think about to be or not to be a Solr committer". If there is such a document and you try to type in this expression from the beginning to the end, you get very strange suggestions until you type "Solr". I tested it with an AnalyzingSuggester in Solr. Hope, I can test your new filter, when there is a factory for AnalyzingInfixSuggester in Solr.

    ReplyDelete
  2. This sounds a great addition, Mike!

    I was long ago wondering if it is possible to keep stop words inside a token n-gram sequence and remove them on the boundaries. Would you have any ideas in this area too?

    For reference, the question on stackoverflow (suggestions done with other means, but the principle remains):

    http://stackoverflow.com/questions/4954735/autocomplete-via-shingles-and-termvector-component

    ReplyDelete
  3. Hi Dmitry,

    Couldn't you make a custom stop filter that only removed stop words at the start (first token(s) seen) or end of the input (no non-stopword tokens seen after)? It'd required some buffering / state keeping (capture/restorteState) but it seem doable?

    ReplyDelete
    Replies
    1. Hi Mike,

      That sounds like a reasonable idea. Thanks!

      Delete
  4. I assumed this was in the standard distribution of Solr4.5+, but no? I tried
    <filter class="org.apache.lucene.search.suggest.analyzing.SuggestStopFilter"
    but I get a plugin init failure. The filter looks quite helpful. Thanks.

    ReplyDelete
    Replies
    1. Hmm unfortunately it looks like this hasn't been exposed through Solr / as a factory. Maybe open an issue?

      Delete
  5. Hi Michale,

    Thanks for very much informative post.
    do we have any config analyzer or filter in solr to stop suggesting the phrase that ends with stopwords?

    For ex:
    If the suggestion are as below for query http://localhost/solr/suggest?q=jazz+a
    "suggestion": [
    "jazz and",
    "jazz at",
    "jazz at lincoln",
    "jazz at lincoln center",
    "jazz artists",
    "jazz and classic"
    ]
    Is there any config or solution to remove only "jazz at" and "jazz and" phrases so that the final suggestion response looks more sensible!

    "suggestion": [
    "jazz at lincoln",
    "jazz at lincoln center",
    "jazz artists",
    "jazz and classic"
    ]

    Google does this intelligently :)

    I have tested with StopFilterFactory and SuggestStopFilter both of which does not this.

    Do i have to come up with a custom plugin to do this in solr?


    Thanks,
    Rajesh.

    ReplyDelete
    Replies
    1. Which suggester are you using? Free text?

      Delete
    2. Hello Michael, Rajesh. I had a similar requirement to remove trailing stopwords from a shingle-based suggester, so I implemented a filter that does just that. You might want to have a look at https://github.com/spyk/shingle-stop-filter. Thanks!

      Delete