Changing Bits: Lucene's FuzzyQuery is 100 times faster in 4.0

Thursday, March 24, 2011

Lucene's FuzzyQuery is 100 times faster in 4.0

There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.

The journey begins

The long journey began when Robert Muir had the idea of pre-building a Levenshtein Automaton, a deterministic automaton (DFA) that accepts only the terms within edit distance N. Doing this, up front, and then intersecting that automaton with the terms in the index, should give a massive speedup, he reasoned.

At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to N insertions, deletions and substitutions. But, unfortunately, just building that DFA (let alone then intersecting it with the terms in the index), was too slow.

Fortunately, after some Googling, he discovered a paper, by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr committers!) detailing an efficient algorithm for building the Levenshtein Automaton from a given base term and max edit distance. All he had to do is code it up! It's just software after all. Somehow, he roped Mark Miller, another Lucene/Solr committer, into helping him do this.

Unfortunately, the paper was nearly unintelligible! It's 67 pages, filled with all sorts of equations, Greek symbols, definitions, propositions, lemmas, proofs. It uses scary concepts like Subsumption Triangles, along with beautiful yet still unintelligible diagrams. Really the paper may as well have been written in Latin.

Much coffee and beer was consumed, sometimes simultaneously. Many hours were spent on IRC, staying up all night, with Mark and Robert carrying on long conversations, which none of the rest of us could understand, trying desperately to decode the paper and turn it into Java code. Weeks went by like this and they actually had made some good initial progress, managing to loosely crack the paper to the point where they had a test implementation of the N=1 case, and it seemed to work. But generalizing that to the N=2 case was... daunting.

The breakthrough

Then, finally, a breakthrough! Robert found, after even more Googling, an existence proof, in an unexpected place: an open-source package, Moman, under the generous MIT license. The author, Jean-Phillipe Barrette-LaPierre, had somehow, incredibly, magically, quietly, implemented the algorithm from this paper. And this was apparently a random side project for him, unrelated to his day job. So now we knew it was possible (and we all have deep admiration for Jean-Phillipe!).

We decided to simply re-use Moman's implementation to accomplish our goals. But, it turns out, its source code is all Python (my favorite programming language)! And, nearly as hairy as the paper itself. Nevertheless, we pushed on.

Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code, to auto-generate Java code containing the necessary tables for each max edit distance case (N=1, N=2, etc.). We had to guess what each Python function did, by its name, trying to roughly match this up to the spooky terminology in the paper.

The result was createLevAutomata.py: it auto-generates crazy looking Java code (see Lev2ParametricDescription.java, and scroll to the cryptic packed tables at the bottom), which in turn is used by further Java code to create the Levenshtein automaton per-query. We only generate the N=1 and N=2 cases (the N>=3 cases aren't really practical, at least not yet).

The last bug...

Realize, now, what a crazy position we were in. We wrote our own scary Python code, tapping into various functions in the Moman package, to auto-generate unreadable Java code with big tables of numbers, which is then used to generate Levenshtein automata from the base term and N. We went through many iterations with this crazy chain of Python and Java code that we barely understood, slowly iterating to get the bugs out.

After fixing many problems, we still had one persistent bug which we just couldn't understand, let alone fix. We struggled for several days, assuming the bug was in our crazy Python/Java chain. Finally, we considered the possibility that the bug was in Moman, and indeed Robert managed to reduce the problem to a tiny Python-only case showing where Moman failed to match the right terms. Robert sent this example to Jean-Phillipe, who quickly confirmed the bug and posted a patch the next day. We applied his patch and suddenly everything was working perfectly!

Fortunately, while this fast FuzzyQuery was unbelievably hairy to implement, testing it well is relatively easy since we can validate it against the brute-force enumeration from 3.0. We have several tests verifying the different layers executed by the full FuzzyQuery. The tests are exhaustive in that they test all structurally different cases possible in the Levenshtein construction, using a binary (only characters 0 and 1) terms.

Beyond just solving this nearly impossible task of efficiently compiling a term to a Levenshtein Automaton, we had many other parts to fill in. For example, Robert separately created a general AutomatonQuery, re-using infrastructure from the open-source Brics automaton package, to enable fast intersection of an automaton against all terms and documents in the index. This query is now used to handle WildcardQuery, RegexpQuery, and FuzzyQuery. It's also useful for custom cases, too; for example it's used by Solr to reverse wildcard queries. These slides from Robert describe AutomatonQuery, and its fun possible use case, in more detail.

Separately, we had an impedance mismatch: these automatons speak full unicode (UTF32) characters, yet Lucene's terms are stored in UTF8 bytes, so we had to create a UTF32 -> UTF8 automaton converter, which by itself was also very hairy! That converter translates any UTF32 automaton into an equivalent UTF8 Levenshtein automaton, which can be directly intersected against the terms in the index.

So, today, when you run a FuzzyQuery in 4.0, it efficiently seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton. This, coupled with ongoing performance improvements to seeking and scanning terms, as well as other major improvements like performing MultiTermQuery rewrites per-segment, has given us the astounding overall gains in FuzzyQuery.

Thanks to these enormous performance improvements, Robert has created an entirely new automaton spell checker that uses this same algorithm to find candidate terms for respelling. This is just like FuzzyQuery, except it doesn't visit the matching documents. This is a big improvement over the existing spellchecker as it does not require a separate spellchecker index be maintained.

This whole exciting experience is a great example of why open-source development works so well. Here we have diverse committers from Lucene/Solr, bringing together their various unusual strengths (automatons, Unicode, Python, etc.) to bear on an insanely hard challenge, leveraging other potent open-source packages including Moman and Brics, iterating with the authors of these packages to resolve bugs. No single person involved in this really understands all of the parts; it's truly a team effort.

And now you know what's going on under the hood when you see incredible speedups with FuzzyQuery in 4.0!

[For the not-faint-of-heart, you can browse LUCENE-1606 to see parts of this story unfolding through Jira]

65 comments:

MarkMarch 24, 2011 at 6:10 PM
Nice post! Interesting to see this story put to print.

I'd only make one correction to the history line:

We did finally loosely crack the paper during those late night sessions - allowing Robert to do a test impl for n=1. There was still some leaps to be made to generalize those steps for n > 1.

Thankfully you guys came up with the brilliant idea of using Moman for java code generation instead...
ReplyDelete
Replies
Uwe SchindlerMarch 24, 2011 at 7:22 PM
This comment has been removed by the author.
ReplyDelete
Replies
Uwe SchindlerMarch 24, 2011 at 7:26 PM
After reading this post I am still not understanding any of this code :-) But it works!

Really phantastic! I was only trying to help out with beer for Robert and some tiny improvements to speed up sorting in Automaton or generics police the code.

It was funny to listen to the discussions between Mike and Robert on the party at Mike's house!
ReplyDelete
Replies
Michael McCandlessMarch 25, 2011 at 6:52 AM
Hi Mark,

Thanks for the correction! I reworded that part...
ReplyDelete
Replies
John WangMarch 25, 2011 at 12:33 PM
Hi Mike:

Super awesome post!

Is the built automaton built per segment? If so, does it handle merge from automatons from different segments?

Thanks

-John
ReplyDelete
Replies
UnknownMarch 25, 2011 at 2:57 PM
Fantastic post Mike! It's really nice to hear about the behind-the-scenes story of how something in the research literature makes its way into Lucene and all the hard work you guys do to make it happen. I'm really looking forward to 4.0!

Tom Burton-West
ReplyDelete
Replies
Michael McCandlessMarch 25, 2011 at 4:12 PM
Hi John,

The Levenshtein automaton is actually built once up front, and then "intersected" with each segment's terms, segment by segment. We have a single PQ that's used to merge the terms from each segment, then at the end we take the top terms from this PQ and create the real query (this is in TopTermsScoringBooleanQueryRewrite, in MultiTermQuery).
ReplyDelete
Replies
Michael McCandlessMarch 25, 2011 at 4:13 PM
Thanks Tom! It was really quite crazy while this story was unfolding... it wasn't at all clear it was going to have a happy ending!
ReplyDelete
Replies
AnonymousMarch 25, 2011 at 11:30 PM
Great post, Mike!

You'll find Robert's presentation about Finate State Queries in Lucene at http://www.slideshare.net/otisg
ReplyDelete
Replies
Michael McCandlessMarch 26, 2011 at 6:05 AM
Thanks Otis; I added a link to Robert's slides.
ReplyDelete
Replies
Jean-Philippe Barrette-LaPierreApril 2, 2011 at 9:43 AM
Great post! I'm glad that my code could help such a great project. I'm always happy to help.
ReplyDelete
Replies
Michael McCandlessApril 2, 2011 at 11:20 AM
Jean-Phillipe,

Thank you for creating Moman and open-sourcing it (with a friendly license) in the first place! Without that, I don't know how we would have made it through here...

Mike
ReplyDelete
Replies
Todd HoffApril 17, 2011 at 10:56 AM
What a great story!
ReplyDelete
Replies
AnonymousApril 18, 2011 at 6:48 AM
You guys are awesome!
ReplyDelete
Replies
AnonymousApril 20, 2011 at 10:45 AM
I introduced this post to Japanese people.

http://d.hatena.ne.jp/nokuno/20110329/1301349835

Thank you for writing great story about Lucene.
ReplyDelete
Replies
AndyApril 22, 2011 at 7:37 PM
Great post. It reads like a thriller!

The way Python code was written to tap into Moman to auto-generate java code that is used by other java code does sound scary. Why not just run Moman on Jython?

Also can you briefly explain why N=3 is not practical?

Thanks.
ReplyDelete
Replies
MichaelApril 23, 2011 at 6:28 PM
Thanks nokuno!
ReplyDelete
Replies
Michael McCandlessApril 23, 2011 at 6:43 PM
Hi Andy,

Moman on Jython would work, but, Lucene is all Java today, so we wanted to keep that.

N=3 is possible, but it produces biggish tables (like a few hundred KB increase in Lucene's JAR, from the packed tables we store in the generated Java code). Further, since the space of terms accepted by N=3 is so large, it's going to result in much more seeking/scanning to intersect with the terms dictionary, so it could be slowish. It'd be fairly easy to test this...
ReplyDelete
Replies
AlexApril 24, 2011 at 11:20 AM
This is code and fix in perfection :-D
ReplyDelete
Replies
AnonymousApril 27, 2011 at 8:27 AM
Is there some reason you didn't contact the authors of the paper with your questions? Seems like they would have been happy to help, or at least put you in touch with a graduate student who could translate the paper for you.
ReplyDelete
Replies
AnonymousApril 27, 2011 at 10:11 AM
It is scary! So basically you guys magically have what you need by translating someone's code without neither understanding of the method and the original code?
ReplyDelete
Replies
Michael McCandlessApril 27, 2011 at 1:32 PM
Anonymous #1,

Well, we did really have a basic grasp of the paper, but translating that theory into actual Java code, in practice, was tricky.
ReplyDelete
Replies
Michael McCandlessApril 27, 2011 at 1:36 PM
Anonymous #2,

I over-dramatized things somewhat... we do have enough of an understanding to believe it's correct. Furthermore, the tests we created are strenuous, and are exhaustive in that we test the different possible characteristic vectors, so if the paper is correct, the algorithm we implemented should be as well (famous last words I know...).
ReplyDelete
Replies
AnonymousApril 28, 2011 at 1:37 PM
Are you allowed to simply copy rewrite the Python code into Java and contribute it to the Apache Software Foundation like this? Is this what you did?

If so, even though the Python code has an MIT License, the work is still copyrighted by the original author and how you have done this seems messy from an intellectual property point of view.

Generally speaking, you can not simply rewrite someone else's work even though it has an MIT license and contribute it to the Apache Software Foundation.
ReplyDelete
Replies
Michael McCandlessApril 29, 2011 at 4:56 AM
Anonymous,

That's a great question. Licensing and open-source are a very complex (and obviously important) topic. I am not a lawyer.... but, here's how the licensing/copyright worked in this case:

First, Moman's license is the MIT/X11 license, which Apache projects are free to incorporate in source and binary form (see http://www.apache.org/legal/3party.html) as this license does not remove any rights from ASL2.

Second, the Moman package is being used only as a library by the code generator we (Lucene developers) wrote; none of Moman's sources are incorporated into Lucene (only the generated Java code as a result of running the generator, and our Python code generator). In theory, Moman's license has no bearing (much like the license of a compiler doesn't apply to the binaries it produces).

But, finally, to be sure, we contacted Jean-Phillipe to confirm he was the sole author of this package, and he was OK with our usage of it (he was), and we've also added Moman's copyright and license in Lucene's LICENSE.txt file.
ReplyDelete
Replies
AnonymousMay 2, 2011 at 10:45 AM
Excellent work. A nice demonstration of how university research can help solve read world problems.
ReplyDelete
Replies
courtJune 19, 2011 at 9:10 PM
There's also a native Java implementation of the Mitankin paper.

http://www.infiauto.com/projects/datastr/javadoc/
ReplyDelete
Replies
AnonymousJune 29, 2011 at 5:00 AM
Mike,
Could you help me to understand how pre-building a Levenshtein Automaton (for terms?) can avoid scanning all terms on run time?
Kind regards,
Youbin Peng
ReplyDelete
Replies
Michael McCandlessJune 29, 2011 at 6:41 AM
Pre-building the Levenshtein automaton changes the problem from "test every term" to a graph intersection problem, ie, we intersect the graph (Levenshtein automaton) with the terms in the terms dictionary by using seek/next APIs already available in Lucene.

In fact, at some point we should add an intersect() method directly into Lucene's APIs, because some terms dict representations can potentially do this more efficiently than seek/next.
ReplyDelete
Replies
AnonymousJune 29, 2011 at 8:10 AM
As I can not fully understand your above explaination, I have the following questions which might be stupid.
1. Suppose we have n terms in the terms dictionary , do we have one graph or n graphes?
2. Why using seek/next APIs can skip some terms?
I do think that we have to calculate the intersection for all terms.
ReplyDelete
Replies
Michael McCandlessJune 29, 2011 at 1:05 PM
You have 1 graph created, from the fuzzy query's term plus edit distance N. This graph is the Levenshtein automaton.

You then intersect that graph with the terms in the index, by iteratively seeking to the next possible match. This seek is able to skip terms because chunks of the terms space cannot possibly match.

For example if your query is foobar~1 (ie, edit distance 1), and you are at terms starting with foba, you know the only possible match with that prefix is fobar so you seek there, possible skipping fobaa, fobap, fobaq, etc.
ReplyDelete
Replies
AnonymousAugust 8, 2011 at 6:27 PM
Great post! And would highly appreciate it if u can give some concrete benchmark results.
ReplyDelete
Replies
casino en ligneNovember 16, 2011 at 7:42 AM
Je suis récemment tombé sur votre blog et ont lu le long. Je pensais que je quitterais mon premier commentaire. Je ne sais pas quoi dire sauf que j'ai apprécié la lecture. Blog de Nice, je vais continuer à visiter ce blog très souvent.
ReplyDelete
Replies
paris sportifDecember 27, 2011 at 4:42 AM
Interesting, is it worth to change an existing code with FuzzyQuery? i have no big lag with the current one, but....
ReplyDelete
Replies
UnknownJanuary 2, 2012 at 3:03 PM
Great post Mike, really interesting insight into upcoming Lucene 4.0 Fuzzy search. Looking forward for 4.0 !
ReplyDelete
Replies
AnonymousJune 5, 2012 at 8:24 AM
I'm really glad to see someone picked up the paper from Klaus Schulz and Stoyan Mihow.
When I first read it in 2007 or 2008 I didn't understand much more of it than it's potential.
Thanks for your efforts coding it within lucene - I would never be capable to do so.
I'm really waiting to see multiple token fuzzy matches in Action!
ReplyDelete
Replies
Mathias KunterAugust 22, 2012 at 12:02 PM
I just stumbled upon this very interesting blog post and have a question: doesn't the complexity of the generation of the Levenshtein automaton in terms of time and storage depend on the used alphabet?

I mean, generation of the automaton for the Latin alphabet with its 26 letters may be easy and efficient, but what about Unicode? Having an alphabet with hundreds of thousands of letters must complicate the situation dramatically, no?
ReplyDelete
Replies
Michael McCandlessAugust 22, 2012 at 12:38 PM
Hi Mathias,

Lucene generates the Levenshtein automaton with full Unicode alphabet
and then we convert that to the equivalent automaton with UTF8 labels
(since terms are stored UTF8 byte[] in the index, by default). I
think this conversion is logically a composition of the FSA with an
FST (but it's not implemented "generically").

This means the edit distance is defined with Unicode characters. So
an edit distance of 2 means you can drop/insert/replace/transpose any 2 full Unicode
chars. This is a difference vs previous versions of Lucene which measure
edit distance in UTF16 code units.

I don't think the cost of building the Levenshtein automaton increases
much with increased alphabet size, as long as your automaton
representation can efficiently handle arcs that match a range of
characters (ours, poached from http://www.brics.dk/automaton, does).
ReplyDelete
Replies
AnonymousSeptember 21, 2012 at 4:21 PM
> Lucene generates the Levenshtein automaton with full Unicode

Keep in mind Unicode is a living standard and the tables should be reviewed yearly.
ReplyDelete
Replies
Michael McCandlessSeptember 22, 2012 at 8:25 AM
Hi Anonymous,

That's fine: how the Unicode Consortium assigns Unicode characters won't affect FuzzyQuery's implementations. The tables we use to generate the Levenshtein automata are agnostic to the binding of each character: they simply accept any int up to Character.MAX_CODE_POINT.
ReplyDelete
Replies
AnonymousOctober 24, 2012 at 10:15 PM
I wonder how fast is 100 times faster? The example of database and query with search time would be useful. I know it can be done in Java pretty fast, like this one:
http://www.softcorporation.com/products/people/index.html
ReplyDelete
Replies
Michael McCandlessOctober 25, 2012 at 7:14 AM
Hi Anonymous,

It's quite fast now ... you see see the nightly benchmarks (http://people.apache.org/~mikemccand/lucenebench/ ) ... ~ 30-40 QPS on the full Wikipedia index.

We also now have a DirectSpellChecker that runs a FuzzyQuery minus visiting the matching documents. This is nice because it avoids requiring the "sidecar" spellcheck index that's necessary with the old spellchecker.
ReplyDelete
Replies
AnonymousJune 4, 2013 at 3:17 AM
Hi Mike,

Its really a nice post.

I have a confusion here regarding Fuzzy Query. Since Solr 4 is supporting fuzzy searches using Edit Distance which needs a parameter i.e. N which can have values as 0 or 1 or 2(max). So, why are the values like 0.4,0.6..till 1 are still supported and 1.5,2.2.. are not ? How does it makes sense ? Is it just for backward compatibility or there is something that I am missing ?

Thanks in advance.
ReplyDelete
Replies
AnonymousJune 5, 2013 at 5:32 AM
Hello Mike,

I am using Solr 4.2.1.
My question is :

Is there a way where I can combine fuzzy search with phonetic searches. Actually, I want to search on the fields like first_name, last_name and so on to get the records that can have some spelling mistakes as well.(Spell suggester is not fit for me as I want to get the solr documents in output not the list of words/strings)

Only fuzzy search is not fit for me as I can at max provide ~2 (edit distance) as fuzziness factor in query and only Phonetic will also not work as there are some words for which encoding in DoubleMetaphone completely changes with the change in a single character.

And also I came to know that with fuzzy search all the query time analysis is by passed. So I am unable to find a way to have both together.

One way that I just found is to have to fields one analyzed with phonetic filters and one as text. Then I could query them as (firstname_text:hello~2 AND firstname_phonetic:hello)

If there is no such way to have both together then is the approach I have in mind is correct or not ?

Waiting for your reply.
Thanks in advance.
ReplyDelete
Replies
learningbyprogrammingAugust 29, 2013 at 6:03 AM
This comment has been removed by the author.
ReplyDelete
Replies
learningbyprogrammingAugust 29, 2013 at 6:56 AM
Michael,
Thanks for the blog -- I had a clarification re the nature of improvement:

In the old Fuzzysearch, the system examined every 'plausible' candidate and computed the actual Levenshtein distance -- an expensive computation for each candidate to decide whether it was within N. Whereas in this new one, it uses a new data structure (the autoaton/table built specifically for the current query) to check whether every 'plausible' candidate is within a distance of N.

Is that a correct understanding?

Thx
ReplyDelete
Replies
AnonymousSeptember 23, 2013 at 5:05 AM
Hi,

iam working on OpenNLP with SOLR. I have successfully applied the patch LUCENE-2899-x.patch to latest SOLR code branch_4x.
I desgined some analyers based on OpenNLP filters and tokenziers and index some documnets on that fields.
Searching on OpenNLP field is not constant. Not able to search on these OpenNLP designed fields in solr schema.xml properly.
Also, how to use payloads for boosting the document.

Please help me on this.
ReplyDelete
Replies
UnknownOctober 1, 2013 at 9:46 AM
Hi Michael, what we can do with FSA is really impressive :)

I was now thinking to the application to an Autocomplete feature :
Currently there are 2 different approaches for Autocomplete :
1) Using the term enum and filtering based on a byte prefix on each instance of term enum ( which is a ByteRef)
2) Using the suggester (org.apache.solr.spelling.suggest.Suggester
org.apache.solr.spelling.suggest.fst.FSTLookup)

The second approach should be very similar to the SpellCheck FSA approach.
So it's faster to use (2)FST or the (1) Byte prefix filter ?

From the point of view of memory?

Cheers
ReplyDelete
Replies
UnknownOctober 3, 2013 at 12:51 PM
Thank you very much, I don't think number are important in this moment for me, it's nice to know that the FST one is faster !
ReplyDelete
Replies
UnknownOctober 19, 2016 at 3:52 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownDecember 15, 2017 at 8:12 AM
nice post
ReplyDelete
Replies

Add comment