Changing Bits: A new Lucene suggester based on infix matches

Saturday, June 22, 2013

A new Lucene suggester based on infix matches

Suggest, sometimes called auto-suggest, type-ahead search or auto-complete, is now an essential search feature ever since Google added it almost 5 years ago.

Lucene has a number of implementations; I previously described AnalyzingSuggester. Since then, FuzzySuggester was also added, which extends AnalyzingSuggester by also accepting mis-spelled inputs.

Here I describe our newest suggester: AnalyzingInfixSuggester, now going through iterations on the LUCENE-4845 Jira issue.

Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.

You can see it in action at the example Jira search application that I built to showcase various Lucene features.

For example, if you enter japan you should see various issues suggested, including:

SOLR-4945: Japanese Autocomplete and Highlighter broken
LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
LUCENE-3921: Add decompose compound Japanese Katakana token capability to Kuromoji

As you can see, the incoming characters can match not just the prefix of each suggestion but also the prefix of any token within.

Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of EdgeNGramTokenFilter to index the short prefixes of each token, up to length 3 by default, for fast prefix querying.

It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom Collector to stop after finding the first N matching hits since these hits are the best matches when sorting by weight. The lookup method lets you specify whether all terms must be found, or any of the terms (Jira search requires all terms).

Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.

Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with AnalyzingInfixSuggester using an index-time analyzer that dropped all but the first three tokens.

One nice benefit of an index-based approach is AnalyzingInfixSuggester handles highlighting of the matched tokens (red color, above), which has unfortunately proven difficult to provide with the FST-based suggesters. Another benefit is, in theory, the suggester could support near-real-time indexing, but I haven't exposed that in the current patch and probably won't for some time (patches welcome!).

Performance is reasonable: somewhere between AnalyzingSuggester and FuzzySuggester, between 58 - 100 kQPS (details on the issue).

Analysis fun

As with AnalyzingSuggester, AnalyzingInfixSuggester let's you separately configure the index-time vs. search-time analyzers. With Jira search, I enabled stop-word removal at index time, but not at search time, so that a query like or would still successfully find any suggestions containing words starting with or, rather than dropping the term entirely.

Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!

51 comments:

UnknownJune 25, 2013 at 1:55 AM
Hello Michael,

Thanks for the contribution. Does it mean that TermQuery significantly outperforms PrefixQuery? I always though that they perform similarly because both are backed on TermEnum.seek()

Thanks
ReplyDelete
Replies
Michael McCandlessJune 25, 2013 at 8:15 AM
Hi Mikhail,

A TermQuery is usually much faster than a PrefixQuery, since it decodes a single docID list vs PrefixQuery which must decode N and do a "union" often with the same doc appearing in many lists.
ReplyDelete
Replies
UnknownJune 25, 2013 at 3:05 PM
That's a quite useful consideration! Thanks. +1
ReplyDelete
Replies
UnknownJuly 23, 2013 at 1:35 AM
in according to our measurements this hit gives about 20% performance gain.
ReplyDelete
Replies
OpuJuly 24, 2013 at 5:54 AM
Hi, I need a small demo source code implementation Lucene Infix match suggestion where input query and hits will load from a text file. Or I need a help to implement that in step by step process. What should I do.
ReplyDelete
Replies
AnonymousJuly 24, 2013 at 12:37 PM
I am confused about how I build an index that AnalyzingInfixSuggester can use. I've tried using an existing index that I have but lookup() returns no hits. The line "It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time" implies that I have to build a custom index that the suggester can understand but I can't figure out how this is done.

Is there an example of basic usage for AnalyzingInfixSuggestr? Thanks!
ReplyDelete
Replies
AnonymousJuly 25, 2013 at 8:26 PM
I am having trouble translating this to lucene 4.4 some classes are missing some are renamed and the test did not show how to point real index that AnalyzingInfixSuggester to rebuild with "payloads" and "weight". Can somebody help?
ReplyDelete
Replies
Michael McCandlessJuly 27, 2013 at 6:46 AM
You should not build your own index. AnalyzingInfixSuggester builds its own index, under the hood when you call the build method.
ReplyDelete
Replies
AnonymousJuly 30, 2013 at 1:03 PM
Hi,
This looks like it is exactly the functionality I have been wanting to implement. I just downloaded and installed solr 4.4 and tried to set this up, however I'm not sure how to reference this in the Suggester setup since I don't see an AnalyzingInfixLookupFactory or something similar. Can you point me in the direction I would need to go to reference this in the solrconfig?
ReplyDelete
Replies
RonaldAugust 4, 2013 at 11:43 AM
Dear Mike,

My long outstanding question and doubt, using lucene how can i index date and numbers,, say credit card numbers,,, finance transaction journals... these journals have lot of numbers,, and date.Am still to crack, apprecaite your help / suggestion.

Regards,
Ronald
ReplyDelete
Replies
UnknownOctober 13, 2013 at 8:26 PM
Hi Mike.

Very interesting article.

You mentioned suggestion lookup in lucene index.

I would have thought that the AnalyzingInfixSuggester keeps an internal data structure in RAM built out of the lucene index and used for fast lookup?

I guess a direct lookup in a lucene index would be much slower.
ReplyDelete
Replies
UnknownNovember 29, 2013 at 6:17 AM
Hi Michael,

I'm working on a custom suggester derived from this AnalyzingInfix. I need to add what you called a "blended score" (//TODO ln.399) to transform the weight depending on the position of the term(s) in the text.
I've tried two ways right now :
- creating a coefficient based on the term position in the text (using TermVector and DocsAndPositionsEnum)
- adding a SpanQuery when searching to get a score that I can multiply with the weight.

Any other suggestions ?

I would be happy to add these changes in Lucene, so do you think it's worth creating a feature ticket in Jira ?
Cheers and thanks for your work!
ReplyDelete
Replies
AashishMay 3, 2014 at 6:59 AM
Hi,

Is there anyway to prioritize prefix matches over infix matches (irrespective of the weight)?

Any help would be appreciated. Thanks!
ReplyDelete
Replies
RaCoJuly 17, 2014 at 4:45 PM
Hi Michael,

Do you have any idea if it's possible to do filtering with this suggester module?

I need to filter the suggestions by their id's. Something like "show me only the suggestiongs for cities with the state id = 14".

Thanks!
ReplyDelete
Replies
UnknownSeptember 7, 2014 at 6:23 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownSeptember 7, 2014 at 6:25 AM
Hi Michael,

Thanks for the article. I have implemented Infix lookup with AnalyzingInfixLookupFactory but I am getting duplicate results in suggester. Do you know to avoid duplicate results here?
ReplyDelete
Replies
mschipperheynFebruary 4, 2015 at 12:22 PM
Would really appreciate some code examples in stead of just end user examples!
ReplyDelete
Replies
SrikrishnaMarch 23, 2015 at 6:35 AM
Hi Micheal
I currently index log files where each line is a Document with fields ( Time,Contents,Filename) using whitespace Analyzer + lowercasefilter . I use a single instance of Document for indexing by changing the field values. Where exactly should i have the Suggesters build method called is it during indexing or during search?
ReplyDelete
Replies

Add comment