Thursday, March 24, 2011

Lucene's FuzzyQuery is 100 times faster in 4.0

There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.


The journey begins

The long journey began when Robert Muir had the idea of pre-building a Levenshtein Automaton, a deterministic automaton (DFA) that accepts only the terms within edit distance N. Doing this, up front, and then intersecting that automaton with the terms in the index, should give a massive speedup, he reasoned.

At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to N insertions, deletions and substitutions. But, unfortunately, just building that DFA (let alone then intersecting it with the terms in the index), was too slow.

Fortunately, after some Googling, he discovered a paper, by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr committers!) detailing an efficient algorithm for building the Levenshtein Automaton from a given base term and max edit distance. All he had to do is code it up! It's just software after all. Somehow, he roped Mark Miller, another Lucene/Solr committer, into helping him do this.

Unfortunately, the paper was nearly unintelligible! It's 67 pages, filled with all sorts of equations, Greek symbols, definitions, propositions, lemmas, proofs. It uses scary concepts like Subsumption Triangles, along with beautiful yet still unintelligible diagrams. Really the paper may as well have been written in Latin.

Much coffee and beer was consumed, sometimes simultaneously. Many hours were spent on IRC, staying up all night, with Mark and Robert carrying on long conversations, which none of the rest of us could understand, trying desperately to decode the paper and turn it into Java code. Weeks went by like this and they actually had made some good initial progress, managing to loosely crack the paper to the point where they had a test implementation of the N=1 case, and it seemed to work. But generalizing that to the N=2 case was... daunting.


The breakthrough

Then, finally, a breakthrough! Robert found, after even more Googling, an existence proof, in an unexpected place: an open-source package, Moman, under the generous MIT license. The author, Jean-Phillipe Barrette-LaPierre, had somehow, incredibly, magically, quietly, implemented the algorithm from this paper. And this was apparently a random side project for him, unrelated to his day job. So now we knew it was possible (and we all have deep admiration for Jean-Phillipe!).

We decided to simply re-use Moman's implementation to accomplish our goals. But, it turns out, its source code is all Python (my favorite programming language)! And, nearly as hairy as the paper itself. Nevertheless, we pushed on.

Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code, to auto-generate Java code containing the necessary tables for each max edit distance case (N=1, N=2, etc.). We had to guess what each Python function did, by its name, trying to roughly match this up to the spooky terminology in the paper.

The result was createLevAutomata.py: it auto-generates crazy looking Java code (see Lev2ParametricDescription.java, and scroll to the cryptic packed tables at the bottom), which in turn is used by further Java code to create the Levenshtein automaton per-query. We only generate the N=1 and N=2 cases (the N>=3 cases aren't really practical, at least not yet).


The last bug...

Realize, now, what a crazy position we were in. We wrote our own scary Python code, tapping into various functions in the Moman package, to auto-generate unreadable Java code with big tables of numbers, which is then used to generate Levenshtein automata from the base term and N. We went through many iterations with this crazy chain of Python and Java code that we barely understood, slowly iterating to get the bugs out.

After fixing many problems, we still had one persistent bug which we just couldn't understand, let alone fix. We struggled for several days, assuming the bug was in our crazy Python/Java chain. Finally, we considered the possibility that the bug was in Moman, and indeed Robert managed to reduce the problem to a tiny Python-only case showing where Moman failed to match the right terms. Robert sent this example to Jean-Phillipe, who quickly confirmed the bug and posted a patch the next day. We applied his patch and suddenly everything was working perfectly!

Fortunately, while this fast FuzzyQuery was unbelievably hairy to implement, testing it well is relatively easy since we can validate it against the brute-force enumeration from 3.0. We have several tests verifying the different layers executed by the full FuzzyQuery. The tests are exhaustive in that they test all structurally different cases possible in the Levenshtein construction, using a binary (only characters 0 and 1) terms.

Beyond just solving this nearly impossible task of efficiently compiling a term to a Levenshtein Automaton, we had many other parts to fill in. For example, Robert separately created a general AutomatonQuery, re-using infrastructure from the open-source Brics automaton package, to enable fast intersection of an automaton against all terms and documents in the index. This query is now used to handle WildcardQuery, RegexpQuery, and FuzzyQuery. It's also useful for custom cases, too; for example it's used by Solr to reverse wildcard queries. These slides from Robert describe AutomatonQuery, and its fun possible use case, in more detail.

Separately, we had an impedance mismatch: these automatons speak full unicode (UTF32) characters, yet Lucene's terms are stored in UTF8 bytes, so we had to create a UTF32 -> UTF8 automaton converter, which by itself was also very hairy! That converter translates any UTF32 automaton into an equivalent UTF8 Levenshtein automaton, which can be directly intersected against the terms in the index.

So, today, when you run a FuzzyQuery in 4.0, it efficiently seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton. This, coupled with ongoing performance improvements to seeking and scanning terms, as well as other major improvements like performing MultiTermQuery rewrites per-segment, has given us the astounding overall gains in FuzzyQuery.

Thanks to these enormous performance improvements, Robert has created an entirely new automaton spell checker that uses this same algorithm to find candidate terms for respelling. This is just like FuzzyQuery, except it doesn't visit the matching documents. This is a big improvement over the existing spellchecker as it does not require a separate spellchecker index be maintained.

This whole exciting experience is a great example of why open-source development works so well. Here we have diverse committers from Lucene/Solr, bringing together their various unusual strengths (automatons, Unicode, Python, etc.) to bear on an insanely hard challenge, leveraging other potent open-source packages including Moman and Brics, iterating with the authors of these packages to resolve bugs. No single person involved in this really understands all of the parts; it's truly a team effort.

And now you know what's going on under the hood when you see incredible speedups with FuzzyQuery in 4.0!

[For the not-faint-of-heart, you can browse LUCENE-1606 to see parts of this story unfolding through Jira]

61 comments:

  1. Nice post! Interesting to see this story put to print.

    I'd only make one correction to the history line:

    We did finally loosely crack the paper during those late night sessions - allowing Robert to do a test impl for n=1. There was still some leaps to be made to generalize those steps for n > 1.

    Thankfully you guys came up with the brilliant idea of using Moman for java code generation instead...

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. After reading this post I am still not understanding any of this code :-) But it works!

    Really phantastic! I was only trying to help out with beer for Robert and some tiny improvements to speed up sorting in Automaton or generics police the code.

    It was funny to listen to the discussions between Mike and Robert on the party at Mike's house!

    ReplyDelete
  4. Hi Mark,

    Thanks for the correction! I reworded that part...

    ReplyDelete
  5. Hi Mike:

    Super awesome post!

    Is the built automaton built per segment? If so, does it handle merge from automatons from different segments?

    Thanks

    -John

    ReplyDelete
  6. Fantastic post Mike! It's really nice to hear about the behind-the-scenes story of how something in the research literature makes its way into Lucene and all the hard work you guys do to make it happen. I'm really looking forward to 4.0!

    Tom Burton-West

    ReplyDelete
  7. Hi John,

    The Levenshtein automaton is actually built once up front, and then "intersected" with each segment's terms, segment by segment. We have a single PQ that's used to merge the terms from each segment, then at the end we take the top terms from this PQ and create the real query (this is in TopTermsScoringBooleanQueryRewrite, in MultiTermQuery).

    ReplyDelete
  8. Thanks Tom! It was really quite crazy while this story was unfolding... it wasn't at all clear it was going to have a happy ending!

    ReplyDelete
  9. Great post, Mike!

    You'll find Robert's presentation about Finate State Queries in Lucene at http://www.slideshare.net/otisg

    ReplyDelete
  10. Thanks Otis; I added a link to Robert's slides.

    ReplyDelete
  11. Great post! I'm glad that my code could help such a great project. I'm always happy to help.

    ReplyDelete
  12. Jean-Phillipe,

    Thank you for creating Moman and open-sourcing it (with a friendly license) in the first place! Without that, I don't know how we would have made it through here...

    Mike

    ReplyDelete
  13. You guys are awesome!

    ReplyDelete
  14. I introduced this post to Japanese people.

    http://d.hatena.ne.jp/nokuno/20110329/1301349835

    Thank you for writing great story about Lucene.

    ReplyDelete
  15. Great post. It reads like a thriller!

    The way Python code was written to tap into Moman to auto-generate java code that is used by other java code does sound scary. Why not just run Moman on Jython?

    Also can you briefly explain why N=3 is not practical?

    Thanks.

    ReplyDelete
  16. Hi Andy,

    Moman on Jython would work, but, Lucene is all Java today, so we wanted to keep that.

    N=3 is possible, but it produces biggish tables (like a few hundred KB increase in Lucene's JAR, from the packed tables we store in the generated Java code). Further, since the space of terms accepted by N=3 is so large, it's going to result in much more seeking/scanning to intersect with the terms dictionary, so it could be slowish. It'd be fairly easy to test this...

    ReplyDelete
  17. This is code and fix in perfection :-D

    ReplyDelete
  18. Is there some reason you didn't contact the authors of the paper with your questions? Seems like they would have been happy to help, or at least put you in touch with a graduate student who could translate the paper for you.

    ReplyDelete
  19. It is scary! So basically you guys magically have what you need by translating someone's code without neither understanding of the method and the original code?

    ReplyDelete
  20. Anonymous #1,

    Well, we did really have a basic grasp of the paper, but translating that theory into actual Java code, in practice, was tricky.

    ReplyDelete
  21. Anonymous #2,

    I over-dramatized things somewhat... we do have enough of an understanding to believe it's correct. Furthermore, the tests we created are strenuous, and are exhaustive in that we test the different possible characteristic vectors, so if the paper is correct, the algorithm we implemented should be as well (famous last words I know...).

    ReplyDelete
  22. Are you allowed to simply copy rewrite the Python code into Java and contribute it to the Apache Software Foundation like this? Is this what you did?

    If so, even though the Python code has an MIT License, the work is still copyrighted by the original author and how you have done this seems messy from an intellectual property point of view.

    Generally speaking, you can not simply rewrite someone else's work even though it has an MIT license and contribute it to the Apache Software Foundation.

    ReplyDelete
  23. Anonymous,

    That's a great question. Licensing and open-source are a very complex (and obviously important) topic. I am not a lawyer.... but, here's how the licensing/copyright worked in this case:

    First, Moman's license is the MIT/X11 license, which Apache projects are free to incorporate in source and binary form (see http://www.apache.org/legal/3party.html) as this license does not remove any rights from ASL2.

    Second, the Moman package is being used only as a library by the code generator we (Lucene developers) wrote; none of Moman's sources are incorporated into Lucene (only the generated Java code as a result of running the generator, and our Python code generator). In theory, Moman's license has no bearing (much like the license of a compiler doesn't apply to the binaries it produces).

    But, finally, to be sure, we contacted Jean-Phillipe to confirm he was the sole author of this package, and he was OK with our usage of it (he was), and we've also added Moman's copyright and license in Lucene's LICENSE.txt file.

    ReplyDelete
  24. Excellent work. A nice demonstration of how university research can help solve read world problems.

    ReplyDelete
  25. There's also a native Java implementation of the Mitankin paper.

    http://www.infiauto.com/projects/datastr/javadoc/

    ReplyDelete
  26. Mike,
    Could you help me to understand how pre-building a Levenshtein Automaton (for terms?) can avoid scanning all terms on run time?
    Kind regards,
    Youbin Peng

    ReplyDelete
  27. Pre-building the Levenshtein automaton changes the problem from "test every term" to a graph intersection problem, ie, we intersect the graph (Levenshtein automaton) with the terms in the terms dictionary by using seek/next APIs already available in Lucene.

    In fact, at some point we should add an intersect() method directly into Lucene's APIs, because some terms dict representations can potentially do this more efficiently than seek/next.

    ReplyDelete
  28. As I can not fully understand your above explaination, I have the following questions which might be stupid.
    1. Suppose we have n terms in the terms dictionary , do we have one graph or n graphes?
    2. Why using seek/next APIs can skip some terms?
    I do think that we have to calculate the intersection for all terms.

    ReplyDelete
  29. You have 1 graph created, from the fuzzy query's term plus edit distance N. This graph is the Levenshtein automaton.

    You then intersect that graph with the terms in the index, by iteratively seeking to the next possible match. This seek is able to skip terms because chunks of the terms space cannot possibly match.

    For example if your query is foobar~1 (ie, edit distance 1), and you are at terms starting with foba, you know the only possible match with that prefix is fobar so you seek there, possible skipping fobaa, fobap, fobaq, etc.

    ReplyDelete
  30. Great post! And would highly appreciate it if u can give some concrete benchmark results.

    ReplyDelete
  31. Je suis récemment tombé sur votre blog et ont lu le long. Je pensais que je quitterais mon premier commentaire. Je ne sais pas quoi dire sauf que j'ai apprécié la lecture. Blog de ​​Nice, je vais continuer à visiter ce blog très souvent.

    ReplyDelete
  32. Interesting, is it worth to change an existing code with FuzzyQuery? i have no big lag with the current one, but....

    ReplyDelete
  33. Great post Mike, really interesting insight into upcoming Lucene 4.0 Fuzzy search. Looking forward for 4.0 !

    ReplyDelete
  34. I'm really glad to see someone picked up the paper from Klaus Schulz and Stoyan Mihow.
    When I first read it in 2007 or 2008 I didn't understand much more of it than it's potential.
    Thanks for your efforts coding it within lucene - I would never be capable to do so.
    I'm really waiting to see multiple token fuzzy matches in Action!

    ReplyDelete
  35. I just stumbled upon this very interesting blog post and have a question: doesn't the complexity of the generation of the Levenshtein automaton in terms of time and storage depend on the used alphabet?

    I mean, generation of the automaton for the Latin alphabet with its 26 letters may be easy and efficient, but what about Unicode? Having an alphabet with hundreds of thousands of letters must complicate the situation dramatically, no?

    ReplyDelete
  36. Hi Mathias,

    Lucene generates the Levenshtein automaton with full Unicode alphabet
    and then we convert that to the equivalent automaton with UTF8 labels
    (since terms are stored UTF8 byte[] in the index, by default). I
    think this conversion is logically a composition of the FSA with an
    FST (but it's not implemented "generically").

    This means the edit distance is defined with Unicode characters. So
    an edit distance of 2 means you can drop/insert/replace/transpose any 2 full Unicode
    chars. This is a difference vs previous versions of Lucene which measure
    edit distance in UTF16 code units.

    I don't think the cost of building the Levenshtein automaton increases
    much with increased alphabet size, as long as your automaton
    representation can efficiently handle arcs that match a range of
    characters (ours, poached from http://www.brics.dk/automaton, does).

    ReplyDelete
  37. > Lucene generates the Levenshtein automaton with full Unicode

    Keep in mind Unicode is a living standard and the tables should be reviewed yearly.

    ReplyDelete
  38. Hi Anonymous,

    That's fine: how the Unicode Consortium assigns Unicode characters won't affect FuzzyQuery's implementations. The tables we use to generate the Levenshtein automata are agnostic to the binding of each character: they simply accept any int up to Character.MAX_CODE_POINT.

    ReplyDelete
  39. I wonder how fast is 100 times faster? The example of database and query with search time would be useful. I know it can be done in Java pretty fast, like this one:
    http://www.softcorporation.com/products/people/index.html

    ReplyDelete
  40. Hi Anonymous,

    It's quite fast now ... you see see the nightly benchmarks (http://people.apache.org/~mikemccand/lucenebench/ ) ... ~ 30-40 QPS on the full Wikipedia index.

    We also now have a DirectSpellChecker that runs a FuzzyQuery minus visiting the matching documents. This is nice because it avoids requiring the "sidecar" spellcheck index that's necessary with the old spellchecker.

    ReplyDelete
  41. Hi Mike,

    Its really a nice post.

    I have a confusion here regarding Fuzzy Query. Since Solr 4 is supporting fuzzy searches using Edit Distance which needs a parameter i.e. N which can have values as 0 or 1 or 2(max). So, why are the values like 0.4,0.6..till 1 are still supported and 1.5,2.2.. are not ? How does it makes sense ? Is it just for backward compatibility or there is something that I am missing ?

    Thanks in advance.

    ReplyDelete
    Replies
    1. Hi Anonymous,

      I believe a value > 1 is supposed to be an integer edit distance, while a value <= 1 is allowed to be the legacy similarity measure (which under the hood is changed to an edit distance based on the length of the query term). But maybe ask this question on java-user@lucene.apache.org to be sure!

      Delete
    2. Hi Mike,

      Thanks for the reply.
      I have asked question to java-user@lucene.apache.org.

      As per your reply, what I understand is, the values between 0 to 1 are still allowed so as not to make changes in the way lucene is queried in case of fuzzy matching. And the values like 1 or 2 are supported to provide a parameter for fuzzy search explicitly(where we can specify the number that signifies the edit distance between source and target strings , however value between 0 to 1 also does the same thing but with some internal calculation).

      Please correct my understanding if wrong.


      Delete
    3. Hi Anonymous,

      I couldn't find your email to java-user ... what was the subject line?

      Your understanding is exactly right!

      Delete
  42. Hello Mike,

    I am using Solr 4.2.1.
    My question is :

    Is there a way where I can combine fuzzy search with phonetic searches. Actually, I want to search on the fields like first_name, last_name and so on to get the records that can have some spelling mistakes as well.(Spell suggester is not fit for me as I want to get the solr documents in output not the list of words/strings)

    Only fuzzy search is not fit for me as I can at max provide ~2 (edit distance) as fuzziness factor in query and only Phonetic will also not work as there are some words for which encoding in DoubleMetaphone completely changes with the change in a single character.

    And also I came to know that with fuzzy search all the query time analysis is by passed. So I am unable to find a way to have both together.

    One way that I just found is to have to fields one analyzed with phonetic filters and one as text. Then I could query them as (firstname_text:hello~2 AND firstname_phonetic:hello)

    If there is no such way to have both together then is the approach I have in mind is correct or not ?

    Waiting for your reply.
    Thanks in advance.

    ReplyDelete
    Replies
    1. This is certainly possible with Lucene: just analyzing the field with a phonetic analyzer, do the same at query time, and create a FuzzyQuery from that. But it sounds like you need to know how to do this with Solr? I'm not sure ... did you already email the solr-user@lucene.apache.org list?

      Delete
    2. Hey Michael,
      first of all thanks for this awesome blog!
      I'm currently facing the exact problem as described here. Now I want to analyze the input at query time but I'm not quite sure how to do so. Do I analyze a single String and return the analyzed version?

      Thanks in advance.

      Delete
    3. Hi Anonymous,

      Maybe send an email to java-user@lucene.apache.org? Or, solr-user.

      Delete
  43. This comment has been removed by the author.

    ReplyDelete
  44. Michael,
    Thanks for the blog -- I had a clarification re the nature of improvement:

    In the old Fuzzysearch, the system examined every 'plausible' candidate and computed the actual Levenshtein distance -- an expensive computation for each candidate to decide whether it was within N. Whereas in this new one, it uses a new data structure (the autoaton/table built specifically for the current query) to check whether every 'plausible' candidate is within a distance of N.

    Is that a correct understanding?

    Thx

    ReplyDelete
    Replies
    1. Hi learningbyprogramming,

      That's correct, except with the approach in 4.x, since we pre-compile the space of all terms within edit distance N of the target term into an automaton up front, visiting the matching terms is a much, much faster process (just an intersection of that automaton with the tree structure of the terms in the terms dictionary).

      Delete
    2. Hi Michael,

      Many thanks for your response. However I didnt understand what you mean by pre-compile since the target is a runtime query.

      Regds

      Delete
    3. What I meant by pre-compile is, for each query, we build the automaton accepting all terms within edit distance N of the query, up front. After that it's a fast intersection with all terms.

      Delete
  45. Hi,

    iam working on OpenNLP with SOLR. I have successfully applied the patch LUCENE-2899-x.patch to latest SOLR code branch_4x.
    I desgined some analyers based on OpenNLP filters and tokenziers and index some documnets on that fields.
    Searching on OpenNLP field is not constant. Not able to search on these OpenNLP designed fields in solr schema.xml properly.
    Also, how to use payloads for boosting the document.

    Please help me on this.

    ReplyDelete
    Replies
    1. Hi, you should ask on the Solr user's list (solr-user@lucene.apache.org).

      Delete
  46. Hi Michael, what we can do with FSA is really impressive :)

    I was now thinking to the application to an Autocomplete feature :
    Currently there are 2 different approaches for Autocomplete :
    1) Using the term enum and filtering based on a byte prefix on each instance of term enum ( which is a ByteRef)
    2) Using the suggester (org.apache.solr.spelling.suggest.Suggester
    org.apache.solr.spelling.suggest.fst.FSTLookup)

    The second approach should be very similar to the SpellCheck FSA approach.
    So it's faster to use (2)FST or the (1) Byte prefix filter ?

    From the point of view of memory?

    Cheers

    ReplyDelete
    Replies
    1. I think suggesting via TermsEnum is typically slow; most of our prefix-based suggesters use FST since it's so efficient at doing a prefix lookup followed by "top N paths by score" search.

      Even once you add fuzziness (LevN automaton) the FST based suggester is still quite fast; I forget the numbers but it's on the fuzzy suggester Jira issue ...

      Delete
  47. Thank you very much, I don't think number are important in this moment for me, it's nice to know that the FST one is faster !

    ReplyDelete