Lucene has a number of implementations; I previously described
AnalyzingSuggester. Since then,
FuzzySuggesterwas also added, which extends
AnalyzingSuggesterby also accepting mis-spelled inputs.
Here I describe our newest suggester:
AnalyzingInfixSuggester, now going through iterations on the LUCENE-4845 Jira issue.
Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.
You can see it in action at the example Jira search application that I built to showcase various Lucene features.
For example, if you enter
japanyou should see various issues suggested, including:
- SOLR-4945: Japanese Autocomplete and Highlighter broken
- LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
- LUCENE-3921: Add decompose compound Japanese Katakana token capability to Kuromoji
Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of
EdgeNGramTokenFilterto index the short prefixes of each token, up to length 3 by default, for fast prefix querying.
It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom
Collectorto stop after finding the first N matching hits since these hits are the best matches when sorting by weight. The lookup method lets you specify whether all terms must be found, or any of the terms (Jira search requires all terms).
Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.
Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with
AnalyzingInfixSuggesterusing an index-time analyzer that dropped all but the first three tokens.
One nice benefit of an index-based approach is
AnalyzingInfixSuggesterhandles highlighting of the matched tokens (red color, above), which has unfortunately proven difficult to provide with the FST-based suggesters. Another benefit is, in theory, the suggester could support near-real-time indexing, but I haven't exposed that in the current patch and probably won't for some time (patches welcome!).
Performance is reasonable: somewhere between
FuzzySuggester, between 58 - 100 kQPS (details on the issue).
AnalyzingInfixSuggesterlet's you separately configure the index-time vs. search-time analyzers. With Jira search, I enabled stop-word removal at index time, but not at search time, so that a query like
orwould still successfully find any suggestions containing words starting with
or, rather than dropping the term entirely.
Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!
Thanks for the contribution. Does it mean that TermQuery significantly outperforms PrefixQuery? I always though that they perform similarly because both are backed on TermEnum.seek()
A TermQuery is usually much faster than a PrefixQuery, since it decodes a single docID list vs PrefixQuery which must decode N and do a "union" often with the same doc appearing in many lists.
That's a quite useful consideration! Thanks. +1ReplyDelete
in according to our measurements this hit gives about 20% performance gain.ReplyDelete
Sorry, what's 20% faster? TermQuery vs PrefixQuery?Delete
flipping to TermQuery on EdgeNGrams in case of short string for suggestiing gives 20% performance gain in comparison to using PrefixQuery always.Delete
sorry for the late reply
Ahh, I see.Delete
I'm surprised it's only 20% gain!
I'm surprised that it's so much! I see that it can win if the same prefix occurs more than once in the same string e.g. WOmbat WOmen bag. I've thought that we don't have many such cases.Delete
Hi, I need a small demo source code implementation Lucene Infix match suggestion where input query and hits will load from a text file. Or I need a help to implement that in step by step process. What should I do.ReplyDelete
How about starting from its unit test? https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_4/lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.javaDelete
import org.apache.lucene.search.spell.TermFreqPayloadIterator was not recognized; what maven dependency should I include? I have included a dependency for Spell Checker.Delete
I am confused about how I build an index that AnalyzingInfixSuggester can use. I've tried using an existing index that I have but lookup() returns no hits. The line "It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time" implies that I have to build a custom index that the suggester can understand but I can't figure out how this is done.ReplyDelete
Is there an example of basic usage for AnalyzingInfixSuggestr? Thanks!
You don't build the index; that happens "under the hood". You just use the APIs on AnalyzingInfixSuggester to build ...
See the unit test in my comment above as an example.
Oops, that's what I get for not refreshing the page before commenting.Delete
I assume you meant to post this link: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_4/lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggesterTest.java ? The other one goes to the source.
Thanks, it looks like a good starting point.
Woops, indeed I did! Thanks for posting the correct link!Delete
I am having trouble translating this to lucene 4.4 some classes are missing some are renamed and the test did not show how to point real index that AnalyzingInfixSuggester to rebuild with "payloads" and "weight". Can somebody help?ReplyDelete
The above link is to the 4.4 unit test, so it should compile fine against 4.4.
Could you send your question to the firstname.lastname@example.org lists, and include specifics about what you tried to do / what failed / etc? Thanks.
I have build the tests but when point analyzer to my index it rewrites all data except the "text" "textgrams" "payloads" and "weight" and acquire lock on the current file.Delete
You should not build your own index. AnalyzingInfixSuggester builds its own index, under the hood when you call the build method.ReplyDelete
Yes saw it trough source. One last stupid question, how to tell the suggester to treat new readed document(since i am iterating over my original index with doc.get"content" and doc.get"id" and next id) as separate document since every read its starts from clean index(dont add terms from the last document).Delete
Unfortunately the suggesters don't support "incremental building": you have to fully rebuild the suggest index every time.Delete
That said, since AnalyzingInfixSuggester is just an index under-the-hood, it could in theory easily support adding to an existing suggest index ... it's just that nobody has implemented it yet (patches welcome!).
So, if understand right for every query it has to build the index again? how will find matching documents if it can read only one document at a time.Delete
No, it doesn't rebuild the index for every query. It rebuilds the index whenever you want to change the suggestions (ie, whenever you call the .build method).Delete
The lookup method find the best suggestions of all the ones you indexed with the build method.
Posted reply with source if this can help.Delete
build it. Mike thanks for the help. Need to sleep now :)Delete
This looks like it is exactly the functionality I have been wanting to implement. I just downloaded and installed solr 4.4 and tried to set this up, however I'm not sure how to reference this in the Suggester setup since I don't see an AnalyzingInfixLookupFactory or something similar. Can you point me in the direction I would need to go to reference this in the solrconfig?
Alas, there is no factory for Solr yet, so at the current time you can't use this suggester from Solr (patches welcome!).Delete
My long outstanding question and doubt, using lucene how can i index date and numbers,, say credit card numbers,,, finance transaction journals... these journals have lot of numbers,, and date.Am still to crack, apprecaite your help / suggestion.
Searching on dates and numbers is not a problem; you can use NumericField for a numeric or date field if you want to do range filtering, or just index numbers as individual tokens as part of the text (make sure you use an analyzer that preserves numbers).
You should send mail to the Lucene user's list (email@example.com) with specific questions ...
Thanks a ton Mike, i ll write an email.Delete
Hello. In order to use AnalyzingInfixSuggester do I need to invoke build() method also. I created index with normal code. Passed the location of this index to AnalyzingInfixSuggester. The lookup list is always 0. The analyzer used is StandardAnalyzer.Delete
Where I might be going wrong.
Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_44);
AnalyzingInfixSuggester ainfixSuggester=new AnalyzingInfixSuggester(Version.LUCENE_44, new File("D:\\InfixIndexes"),analyzer);
You should not build your own index ahead of time.Delete
Instead, you call the build() method, which under the hood builds its own index. You feed it a TermFreqIterator (or TermFreqPayloadIterator) that enumerates all of your suggestions + weights.
After the build is done, then you use the lookup method to find suggestions.
Very interesting article.
You mentioned suggestion lookup in lucene index.
I would have thought that the AnalyzingInfixSuggester keeps an internal data structure in RAM built out of the lucene index and used for fast lookup?
I guess a direct lookup in a lucene index would be much slower.
Currently, AnalyzingInfixSuggester just does 'ordinary' lucene searches against the index, but because the index is impact-sorted, it stops collecting hits as soon as it has the topN results, which makes it very fast. Also, as long as the OS has enough RAM, it will cache the hot parts of the index in RAM anyway ...
I'm working on a custom suggester derived from this AnalyzingInfix. I need to add what you called a "blended score" (//TODO ln.399) to transform the weight depending on the position of the term(s) in the text.
I've tried two ways right now :
- creating a coefficient based on the term position in the text (using TermVector and DocsAndPositionsEnum)
- adding a SpanQuery when searching to get a score that I can multiply with the weight.
Any other suggestions ?
I would be happy to add these changes in Lucene, so do you think it's worth creating a feature ticket in Jira ?
Cheers and thanks for your work!
I think you should create a Jira issue so we can add this enhancement to Lucene? This would make a nice improvement.
One challenge is the "early termination" that the suggester does now, since the index is pre-sorted by "impact" on the assumption that we'll only sort by the a-priori weight. We'd need to somehow relax that if we are in fact sorting by a different factor, unless that factor could also be pre-compiled into the index.
Here it is :Delete
Is there anyway to prioritize prefix matches over infix matches (irrespective of the weight)?
Any help would be appreciated. Thanks!
Try BlendedInfixSuggester? It lets you score higher when the hit is in terms closer to the start of the suggestion ... not exactly what you're asking for but maybe it works?Delete
Thanks for your reply. Any idea why the spellcheck.count setting does not work with BlendedInfixSuggester? Seems to give 100 results always.Delete
Sorry, I just realized that it returns 10x the number of requested results by default. I guess it re-orders the results based on how far the matched word is from the start.Delete
Do you have any idea if it's possible to do filtering with this suggester module?
I need to filter the suggestions by their id's. Something like "show me only the suggestiongs for cities with the state id = 14".
Filtering is not easy with Lucene's current suggesters, but it would be simple to add to AnalyzingInfixSuggester since it's "just" a Lucene search under-the-hood. Maybe ask on the user's list (firstname.lastname@example.org)?
This comment has been removed by the author.ReplyDelete
Thanks for the article. I have implemented Infix lookup with AnalyzingInfixLookupFactory but I am getting duplicate results in suggester. Do you know to avoid duplicate results here?
Hi, it's best to ask on the Lucene users' list? (email@example.com)Delete
There is a ticket about this with no resolution yet: https://issues.apache.org/jira/browse/LUCENE-6336Delete
Would really appreciate some code examples in stead of just end user examples!ReplyDelete
Have a look at Lucene's unit tests? They are a simple way to see how to use the APIs...Delete
I currently index log files where each line is a Document with fields ( Time,Contents,Filename) using whitespace Analyzer + lowercasefilter . I use a single instance of Document for indexing by changing the field values. Where exactly should i have the Suggesters build method called is it during indexing or during search?
Can you ask on Lucene's user list? firstname.lastname@example.org