Saturday, June 22, 2013

A new Lucene suggester based on infix matches

Suggest, sometimes called auto-suggest, type-ahead search or auto-complete, is now an essential search feature ever since Google added it almost 5 years ago.

Lucene has a number of implementations; I previously described AnalyzingSuggester. Since then, FuzzySuggester was also added, which extends AnalyzingSuggester by also accepting mis-spelled inputs.

Here I describe our newest suggester: AnalyzingInfixSuggester, now going through iterations on the LUCENE-4845 Jira issue.

Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.

You can see it in action at the example Jira search application that I built to showcase various Lucene features.

For example, if you enter japan you should see various issues suggested, including:
  • SOLR-4945: Japanese Autocomplete and Highlighter broken
  • LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
  • LUCENE-3921: Add decompose compound Japanese Katakana token capability to Kuromoji
As you can see, the incoming characters can match not just the prefix of each suggestion but also the prefix of any token within.

Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of EdgeNGramTokenFilter to index the short prefixes of each token, up to length 3 by default, for fast prefix querying.

It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom Collector to stop after finding the first N matching hits since these hits are the best matches when sorting by weight. The lookup method lets you specify whether all terms must be found, or any of the terms (Jira search requires all terms).

Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.

Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with AnalyzingInfixSuggester using an index-time analyzer that dropped all but the first three tokens.

One nice benefit of an index-based approach is AnalyzingInfixSuggester handles highlighting of the matched tokens (red color, above), which has unfortunately proven difficult to provide with the FST-based suggesters. Another benefit is, in theory, the suggester could support near-real-time indexing, but I haven't exposed that in the current patch and probably won't for some time (patches welcome!).

Performance is reasonable: somewhere between AnalyzingSuggester and FuzzySuggester, between 58 - 100 kQPS (details on the issue).

Analysis fun

As with AnalyzingSuggester, AnalyzingInfixSuggester let's you separately configure the index-time vs. search-time analyzers. With Jira search, I enabled stop-word removal at index time, but not at search time, so that a query like or would still successfully find any suggestions containing words starting with or, rather than dropping the term entirely.

Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!

42 comments:

  1. Hello Michael,

    Thanks for the contribution. Does it mean that TermQuery significantly outperforms PrefixQuery? I always though that they perform similarly because both are backed on TermEnum.seek()

    Thanks

    ReplyDelete
  2. Hi Mikhail,

    A TermQuery is usually much faster than a PrefixQuery, since it decodes a single docID list vs PrefixQuery which must decode N and do a "union" often with the same doc appearing in many lists.

    ReplyDelete
  3. That's a quite useful consideration! Thanks. +1

    ReplyDelete
  4. in according to our measurements this hit gives about 20% performance gain.

    ReplyDelete
    Replies
    1. Sorry, what's 20% faster? TermQuery vs PrefixQuery?

      Delete
    2. flipping to TermQuery on EdgeNGrams in case of short string for suggestiing gives 20% performance gain in comparison to using PrefixQuery always.

      sorry for the late reply

      Delete
    3. Ahh, I see.

      I'm surprised it's only 20% gain!

      Delete
    4. I'm surprised that it's so much! I see that it can win if the same prefix occurs more than once in the same string e.g. WOmbat WOmen bag. I've thought that we don't have many such cases.

      Delete
  5. Hi, I need a small demo source code implementation Lucene Infix match suggestion where input query and hits will load from a text file. Or I need a help to implement that in step by step process. What should I do.

    ReplyDelete
    Replies
    1. How about starting from its unit test? https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_4/lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.java

      Delete
    2. import org.apache.lucene.search.spell.TermFreqPayloadIterator was not recognized; what maven dependency should I include? I have included a dependency for Spell Checker.

      Delete
  6. I am confused about how I build an index that AnalyzingInfixSuggester can use. I've tried using an existing index that I have but lookup() returns no hits. The line "It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time" implies that I have to build a custom index that the suggester can understand but I can't figure out how this is done.

    Is there an example of basic usage for AnalyzingInfixSuggestr? Thanks!

    ReplyDelete
    Replies
    1. Hi Anonymous,

      You don't build the index; that happens "under the hood". You just use the APIs on AnalyzingInfixSuggester to build ...

      See the unit test in my comment above as an example.

      Delete
    2. Oops, that's what I get for not refreshing the page before commenting.

      I assume you meant to post this link: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_4/lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggesterTest.java ? The other one goes to the source.

      Thanks, it looks like a good starting point.

      Delete
    3. Woops, indeed I did! Thanks for posting the correct link!

      Delete
  7. I am having trouble translating this to lucene 4.4 some classes are missing some are renamed and the test did not show how to point real index that AnalyzingInfixSuggester to rebuild with "payloads" and "weight". Can somebody help?

    ReplyDelete
    Replies
    1. Hi Anonymous,

      The above link is to the 4.4 unit test, so it should compile fine against 4.4.

      Could you send your question to the java-user@lucene.apache.org lists, and include specifics about what you tried to do / what failed / etc? Thanks.

      Delete
    2. I have build the tests but when point analyzer to my index it rewrites all data except the "text" "textgrams" "payloads" and "weight" and acquire lock on the current file.

      Delete
  8. You should not build your own index. AnalyzingInfixSuggester builds its own index, under the hood when you call the build method.

    ReplyDelete
    Replies
    1. Yes saw it trough source. One last stupid question, how to tell the suggester to treat new readed document(since i am iterating over my original index with doc.get"content" and doc.get"id" and next id) as separate document since every read its starts from clean index(dont add terms from the last document).

      Delete
    2. Unfortunately the suggesters don't support "incremental building": you have to fully rebuild the suggest index every time.

      That said, since AnalyzingInfixSuggester is just an index under-the-hood, it could in theory easily support adding to an existing suggest index ... it's just that nobody has implemented it yet (patches welcome!).

      Delete
    3. So, if understand right for every query it has to build the index again? how will find matching documents if it can read only one document at a time.

      Delete
    4. No, it doesn't rebuild the index for every query. It rebuilds the index whenever you want to change the suggestions (ie, whenever you call the .build method).

      The lookup method find the best suggestions of all the ones you indexed with the build method.

      Delete
    5. Posted reply with source if this can help.

      Delete
    6. build it. Mike thanks for the help. Need to sleep now :)

      Delete
  9. Hi,
    This looks like it is exactly the functionality I have been wanting to implement. I just downloaded and installed solr 4.4 and tried to set this up, however I'm not sure how to reference this in the Suggester setup since I don't see an AnalyzingInfixLookupFactory or something similar. Can you point me in the direction I would need to go to reference this in the solrconfig?

    ReplyDelete
    Replies
    1. Alas, there is no factory for Solr yet, so at the current time you can't use this suggester from Solr (patches welcome!).

      Delete
  10. Dear Mike,

    My long outstanding question and doubt, using lucene how can i index date and numbers,, say credit card numbers,,, finance transaction journals... these journals have lot of numbers,, and date.Am still to crack, apprecaite your help / suggestion.

    Regards,
    Ronald

    ReplyDelete
    Replies
    1. Hi Ronald,

      Searching on dates and numbers is not a problem; you can use NumericField for a numeric or date field if you want to do range filtering, or just index numbers as individual tokens as part of the text (make sure you use an analyzer that preserves numbers).

      You should send mail to the Lucene user's list (java-user@lucene.apache.org) with specific questions ...

      Delete
    2. Thanks a ton Mike, i ll write an email.


      Regards,
      Ronald

      Delete
    3. Hello. In order to use AnalyzingInfixSuggester do I need to invoke build() method also. I created index with normal code. Passed the location of this index to AnalyzingInfixSuggester. The lookup list is always 0. The analyzer used is StandardAnalyzer.

      Where I might be going wrong.

      Code snippet:

      Analyzer analyzer=new StandardAnalyzer(Version.LUCENE_44);


      AnalyzingInfixSuggester ainfixSuggester=new AnalyzingInfixSuggester(Version.LUCENE_44, new File("D:\\InfixIndexes"),analyzer);

      Delete
    4. You should not build your own index ahead of time.

      Instead, you call the build() method, which under the hood builds its own index. You feed it a TermFreqIterator (or TermFreqPayloadIterator) that enumerates all of your suggestions + weights.

      After the build is done, then you use the lookup method to find suggestions.

      Delete
  11. Hi Mike.

    Very interesting article.

    You mentioned suggestion lookup in lucene index.

    I would have thought that the AnalyzingInfixSuggester keeps an internal data structure in RAM built out of the lucene index and used for fast lookup?

    I guess a direct lookup in a lucene index would be much slower.

    ReplyDelete
    Replies
    1. Hi Aracdius,

      Currently, AnalyzingInfixSuggester just does 'ordinary' lucene searches against the index, but because the index is impact-sorted, it stops collecting hits as soon as it has the topN results, which makes it very fast. Also, as long as the OS has enough RAM, it will cache the hot parts of the index in RAM anyway ...

      Delete
  12. Hi Michael,

    I'm working on a custom suggester derived from this AnalyzingInfix. I need to add what you called a "blended score" (//TODO ln.399) to transform the weight depending on the position of the term(s) in the text.
    I've tried two ways right now :
    - creating a coefficient based on the term position in the text (using TermVector and DocsAndPositionsEnum)
    - adding a SpanQuery when searching to get a score that I can multiply with the weight.

    Any other suggestions ?

    I would be happy to add these changes in Lucene, so do you think it's worth creating a feature ticket in Jira ?
    Cheers and thanks for your work!

    ReplyDelete
    Replies
    1. Hi Remi,

      I think you should create a Jira issue so we can add this enhancement to Lucene? This would make a nice improvement.

      One challenge is the "early termination" that the suggester does now, since the index is pre-sorted by "impact" on the assumption that we'll only sort by the a-priori weight. We'd need to somehow relax that if we are in fact sorting by a different factor, unless that factor could also be pre-compiled into the index.

      Delete
    2. Here it is :
      https://issues.apache.org/jira/browse/LUCENE-5354

      Delete
  13. Hi,

    Is there anyway to prioritize prefix matches over infix matches (irrespective of the weight)?

    Any help would be appreciated. Thanks!

    ReplyDelete
    Replies
    1. Try BlendedInfixSuggester? It lets you score higher when the hit is in terms closer to the start of the suggestion ... not exactly what you're asking for but maybe it works?

      Delete
    2. Thanks for your reply. Any idea why the spellcheck.count setting does not work with BlendedInfixSuggester? Seems to give 100 results always.

      Delete
    3. Sorry, I just realized that it returns 10x the number of requested results by default. I guess it re-orders the results based on how far the matched word is from the start.

      Delete
  14. Hi Michael,

    Do you have any idea if it's possible to do filtering with this suggester module?

    I need to filter the suggestions by their id's. Something like "show me only the suggestiongs for cities with the state id = 14".

    Thanks!

    ReplyDelete