Lucene has a number of implementations; I previously described
AnalyzingSuggester
. Since
then, FuzzySuggester
was also added, which extends
AnalyzingSuggester
by also accepting mis-spelled inputs.
Here I describe our newest suggester:
AnalyzingInfixSuggester
, now going through
iterations on
the LUCENE-4845
Jira issue.
Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.
You can see it in action at the example Jira search application that I built to showcase various Lucene features.
For example, if you enter
japan
you should see various issues suggested,
including:
- SOLR-4945: Japanese Autocomplete and Highlighter broken
- LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
- LUCENE-3921: Add decompose compound Japanese Katakana token capability to Kuromoji
Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of
EdgeNGramTokenFilter
to index the short prefixes of
each token, up to length 3 by default, for fast prefix querying.
It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom
Collector
to stop after finding the first N
matching hits since these hits are the best matches when sorting by
weight. The lookup method lets you specify whether all terms must be
found, or any of the terms
(Jira search
requires all terms).
Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.
Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with
AnalyzingInfixSuggester
using an index-time analyzer
that dropped all but the first three tokens.
One nice benefit of an index-based approach is
AnalyzingInfixSuggester
handles highlighting of the
matched tokens (red color, above),
which has
unfortunately proven difficult to provide with the FST-based
suggesters. Another benefit is, in theory, the suggester could
support near-real-time indexing, but I haven't exposed that in the
current patch and probably won't for some time (patches welcome!).
Performance is reasonable: somewhere between
AnalyzingSuggester
and FuzzySuggester
, between 58 - 100 kQPS (details
on the issue).
Analysis fun
As with
AnalyzingSuggester
, AnalyzingInfixSuggester
let's you separately configure the index-time vs. search-time
analyzers.
With Jira search, I
enabled stop-word removal at index time, but not at
search time, so that a query like or
would still
successfully find any suggestions containing words starting
with or
, rather than dropping the term entirely.
Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!
Hello Michael,
ReplyDeleteThanks for the contribution. Does it mean that TermQuery significantly outperforms PrefixQuery? I always though that they perform similarly because both are backed on TermEnum.seek()
Thanks
Hi Mikhail,
ReplyDeleteA TermQuery is usually much faster than a PrefixQuery, since it decodes a single docID list vs PrefixQuery which must decode N and do a "union" often with the same doc appearing in many lists.
That's a quite useful consideration! Thanks. +1
ReplyDeletein according to our measurements this hit gives about 20% performance gain.
ReplyDeleteSorry, what's 20% faster? TermQuery vs PrefixQuery?
Deleteflipping to TermQuery on EdgeNGrams in case of short string for suggestiing gives 20% performance gain in comparison to using PrefixQuery always.
Deletesorry for the late reply
Ahh, I see.
DeleteI'm surprised it's only 20% gain!
I'm surprised that it's so much! I see that it can win if the same prefix occurs more than once in the same string e.g. WOmbat WOmen bag. I've thought that we don't have many such cases.
DeleteHi, I need a small demo source code implementation Lucene Infix match suggestion where input query and hits will load from a text file. Or I need a help to implement that in step by step process. What should I do.
ReplyDeleteHow about starting from its unit test? https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_4/lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.java
Deleteimport org.apache.lucene.search.spell.TermFreqPayloadIterator was not recognized; what maven dependency should I include? I have included a dependency for Spell Checker.
DeleteI am confused about how I build an index that AnalyzingInfixSuggester can use. I've tried using an existing index that I have but lookup() returns no hits. The line "It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time" implies that I have to build a custom index that the suggester can understand but I can't figure out how this is done.
ReplyDeleteIs there an example of basic usage for AnalyzingInfixSuggestr? Thanks!
Hi Anonymous,
DeleteYou don't build the index; that happens "under the hood". You just use the APIs on AnalyzingInfixSuggester to build ...
See the unit test in my comment above as an example.
Oops, that's what I get for not refreshing the page before commenting.
DeleteI assume you meant to post this link: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_4/lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggesterTest.java ? The other one goes to the source.
Thanks, it looks like a good starting point.
Woops, indeed I did! Thanks for posting the correct link!
DeleteI am having trouble translating this to lucene 4.4 some classes are missing some are renamed and the test did not show how to point real index that AnalyzingInfixSuggester to rebuild with "payloads" and "weight". Can somebody help?
ReplyDeleteHi Anonymous,
DeleteThe above link is to the 4.4 unit test, so it should compile fine against 4.4.
Could you send your question to the java-user@lucene.apache.org lists, and include specifics about what you tried to do / what failed / etc? Thanks.
I have build the tests but when point analyzer to my index it rewrites all data except the "text" "textgrams" "payloads" and "weight" and acquire lock on the current file.
DeleteYou should not build your own index. AnalyzingInfixSuggester builds its own index, under the hood when you call the build method.
ReplyDeleteYes saw it trough source. One last stupid question, how to tell the suggester to treat new readed document(since i am iterating over my original index with doc.get"content" and doc.get"id" and next id) as separate document since every read its starts from clean index(dont add terms from the last document).
DeleteUnfortunately the suggesters don't support "incremental building": you have to fully rebuild the suggest index every time.
DeleteThat said, since AnalyzingInfixSuggester is just an index under-the-hood, it could in theory easily support adding to an existing suggest index ... it's just that nobody has implemented it yet (patches welcome!).
So, if understand right for every query it has to build the index again? how will find matching documents if it can read only one document at a time.
DeleteNo, it doesn't rebuild the index for every query. It rebuilds the index whenever you want to change the suggestions (ie, whenever you call the .build method).
DeleteThe lookup method find the best suggestions of all the ones you indexed with the build method.
Posted reply with source if this can help.
Deletebuild it. Mike thanks for the help. Need to sleep now :)
DeleteHi,
ReplyDeleteThis looks like it is exactly the functionality I have been wanting to implement. I just downloaded and installed solr 4.4 and tried to set this up, however I'm not sure how to reference this in the Suggester setup since I don't see an AnalyzingInfixLookupFactory or something similar. Can you point me in the direction I would need to go to reference this in the solrconfig?
Alas, there is no factory for Solr yet, so at the current time you can't use this suggester from Solr (patches welcome!).
DeleteDear Mike,
ReplyDeleteMy long outstanding question and doubt, using lucene how can i index date and numbers,, say credit card numbers,,, finance transaction journals... these journals have lot of numbers,, and date.Am still to crack, apprecaite your help / suggestion.
Regards,
Ronald
Hi Ronald,
DeleteSearching on dates and numbers is not a problem; you can use NumericField for a numeric or date field if you want to do range filtering, or just index numbers as individual tokens as part of the text (make sure you use an analyzer that preserves numbers).
You should send mail to the Lucene user's list (java-user@lucene.apache.org) with specific questions ...
Thanks a ton Mike, i ll write an email.
DeleteRegards,
Ronald
Hello. In order to use AnalyzingInfixSuggester do I need to invoke build() method also. I created index with normal code. Passed the location of this index to AnalyzingInfixSuggester. The lookup list is always 0. The analyzer used is StandardAnalyzer.
DeleteWhere I might be going wrong.
Code snippet:
Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_44);
AnalyzingInfixSuggester ainfixSuggester=new AnalyzingInfixSuggester(Version.LUCENE_44, new File("D:\\InfixIndexes"),analyzer);
You should not build your own index ahead of time.
DeleteInstead, you call the build() method, which under the hood builds its own index. You feed it a TermFreqIterator (or TermFreqPayloadIterator) that enumerates all of your suggestions + weights.
After the build is done, then you use the lookup method to find suggestions.
Hi Mike.
ReplyDeleteVery interesting article.
You mentioned suggestion lookup in lucene index.
I would have thought that the AnalyzingInfixSuggester keeps an internal data structure in RAM built out of the lucene index and used for fast lookup?
I guess a direct lookup in a lucene index would be much slower.
Hi Aracdius,
DeleteCurrently, AnalyzingInfixSuggester just does 'ordinary' lucene searches against the index, but because the index is impact-sorted, it stops collecting hits as soon as it has the topN results, which makes it very fast. Also, as long as the OS has enough RAM, it will cache the hot parts of the index in RAM anyway ...
Hi Michael,
ReplyDeleteI'm working on a custom suggester derived from this AnalyzingInfix. I need to add what you called a "blended score" (//TODO ln.399) to transform the weight depending on the position of the term(s) in the text.
I've tried two ways right now :
- creating a coefficient based on the term position in the text (using TermVector and DocsAndPositionsEnum)
- adding a SpanQuery when searching to get a score that I can multiply with the weight.
Any other suggestions ?
I would be happy to add these changes in Lucene, so do you think it's worth creating a feature ticket in Jira ?
Cheers and thanks for your work!
Hi Remi,
DeleteI think you should create a Jira issue so we can add this enhancement to Lucene? This would make a nice improvement.
One challenge is the "early termination" that the suggester does now, since the index is pre-sorted by "impact" on the assumption that we'll only sort by the a-priori weight. We'd need to somehow relax that if we are in fact sorting by a different factor, unless that factor could also be pre-compiled into the index.
Here it is :
Deletehttps://issues.apache.org/jira/browse/LUCENE-5354
Hi,
ReplyDeleteIs there anyway to prioritize prefix matches over infix matches (irrespective of the weight)?
Any help would be appreciated. Thanks!
Try BlendedInfixSuggester? It lets you score higher when the hit is in terms closer to the start of the suggestion ... not exactly what you're asking for but maybe it works?
DeleteThanks for your reply. Any idea why the spellcheck.count setting does not work with BlendedInfixSuggester? Seems to give 100 results always.
DeleteSorry, I just realized that it returns 10x the number of requested results by default. I guess it re-orders the results based on how far the matched word is from the start.
DeleteHi Michael,
ReplyDeleteDo you have any idea if it's possible to do filtering with this suggester module?
I need to filter the suggestions by their id's. Something like "show me only the suggestiongs for cities with the state id = 14".
Thanks!
Hi RaCo,
DeleteFiltering is not easy with Lucene's current suggesters, but it would be simple to add to AnalyzingInfixSuggester since it's "just" a Lucene search under-the-hood. Maybe ask on the user's list (java-user@lucene.apache.org)?
This comment has been removed by the author.
ReplyDeleteHi Michael,
ReplyDeleteThanks for the article. I have implemented Infix lookup with AnalyzingInfixLookupFactory but I am getting duplicate results in suggester. Do you know to avoid duplicate results here?
Hi, it's best to ask on the Lucene users' list? (java-user@lucene.apache.org)
DeleteThere is a ticket about this with no resolution yet: https://issues.apache.org/jira/browse/LUCENE-6336
DeleteWould really appreciate some code examples in stead of just end user examples!
ReplyDeleteHave a look at Lucene's unit tests? They are a simple way to see how to use the APIs...
DeleteHi Micheal
ReplyDeleteI currently index log files where each line is a Document with fields ( Time,Contents,Filename) using whitespace Analyzer + lowercasefilter . I use a single instance of Document for indexing by changing the field values. Where exactly should i have the Suggesters build method called is it during indexing or during search?
Hi Srikrishna,
DeleteCan you ask on Lucene's user list? java-user@lucene.apache.org