Suggest, sometimes called auto-suggest, type-ahead search or
auto-complete, is now an essential search feature ever since Google
added it
almost
5
years ago.
Lucene has a number of implementations;
I
previously
described AnalyzingSuggester
. Since
then,
FuzzySuggester
was also added, which extends
AnalyzingSuggester
by also accepting mis-spelled inputs.
Here I describe our newest
suggester:
AnalyzingInfixSuggester
, now going through
iterations on
the
LUCENE-4845
Jira issue.
Unlike the existing suggesters, which generally find suggestions whose
whole prefix matches the current user input, this suggester will find
matches of tokens anywhere in the user input and in the suggestion;
this is why it has
Infix in its name.
You can see it in action
at
the example Jira
search application
that
I
built to showcase various Lucene features.
For example, if you
enter
japan
you should see various issues suggested,
including:
- SOLR-4945: Japanese Autocomplete and Highlighter broken
- LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
- LUCENE-3921: Add decompose compound Japanese Katakana
token capability to Kuromoji
As you can see, the incoming characters can match not just the prefix
of each suggestion but also the prefix of any token within.
Unlike the existing suggesters, this new suggester does not use a
specialized data-structure such
as
FSTs.
Instead, it's an "ordinary" Lucene index under-the-hood, making use
of
EdgeNGramTokenFilter
to index the short prefixes of
each token, up to length 3 by default, for fast prefix querying.
It also uses the
new
index
sorter APIs to pre-sort all postings by suggested weight at index
time, and at lookup time uses a
custom
Collector
to stop after finding the first N
matching hits since these hits are the best matches when sorting by
weight. The lookup method lets you specify whether all terms must be
found, or any of the terms
(
Jira search
requires all terms).
Since the suggestions are sorted solely by weight, and no other
relevance criteria, this suggester is a good fit for applications that
have a strong a-priori weighting for each suggestion, such as a movie
search engine ranking suggestions by popularity, recency or a blend, for
each movie.
In
Jira search I
rank each suggestion (Jira issue) by how recently it was updated.
Specifically, there is no penalty for suggestions with matching tokens
far from the beginning, which could mean the relevance is poor in some
cases; an alternative approach (patch is on the issue) uses FSTs
instead, which can require that the matched tokens are within the
first three tokens, for example. This would also be possible
with
AnalyzingInfixSuggester
using an index-time analyzer
that dropped all but the first three tokens.
One nice benefit of an index-based approach
is
AnalyzingInfixSuggester
handles highlighting of the
matched tokens (red color, above),
which
has
unfortunately proven difficult to provide with the FST-based
suggesters. Another benefit is, in theory, the suggester could
support near-real-time indexing, but I haven't exposed that in the
current patch and probably won't for some time (patches welcome!).
Performance is reasonable: somewhere
between
AnalyzingSuggester
and
FuzzySuggester
, between 58 - 100 kQPS (details
on
the issue).
Analysis fun
As
with
AnalyzingSuggester
,
AnalyzingInfixSuggester
let's you separately configure the index-time vs. search-time
analyzers.
With
Jira search, I
enabled stop-word removal at index time, but not at
search time, so that a query like
or
would still
successfully find any suggestions containing words starting
with
or
, rather than dropping the term entirely.
Which suggester should you use for your application? Impossible to
say! You'll have to test each of Lucene's offerings and pick one.
Auto-suggest is an area where one-size-does-not-fit-all, so it's great
that Lucene is picking up a number of competing implementations.
Whichever you use,
please
give us
feedback so we can further iterate and improve!