There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.
FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.
Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.
The journey begins
The long journey began when Robert Muir had the idea of pre-building a Levenshtein Automaton, a deterministic automaton (DFA) that accepts only the terms within edit distance N. Doing this, up front, and then intersecting that automaton with the terms in the index, should give a massive speedup, he reasoned.
At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to N insertions, deletions and substitutions. But, unfortunately, just building that DFA (let alone then intersecting it with the terms in the index), was too slow.
Fortunately, after some Googling, he discovered a paper, by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr committers!) detailing an efficient algorithm for building the Levenshtein Automaton from a given base term and max edit distance. All he had to do is code it up! It's just software after all. Somehow, he roped Mark Miller, another Lucene/Solr committer, into helping him do this.
Unfortunately, the paper was nearly unintelligible! It's 67 pages, filled with all sorts of equations, Greek symbols, definitions, propositions, lemmas, proofs. It uses scary concepts like Subsumption Triangles, along with beautiful yet still unintelligible diagrams. Really the paper may as well have been written in Latin.
Much coffee and beer was consumed, sometimes simultaneously. Many hours were spent on IRC, staying up all night, with Mark and Robert carrying on long conversations, which none of the rest of us could understand, trying desperately to decode the paper and turn it into Java code. Weeks went by like this and they actually had made some good initial progress, managing to loosely crack the paper to the point where they had a test implementation of the N=1 case, and it seemed to work. But generalizing that to the N=2 case was... daunting.
The breakthrough
Then, finally, a breakthrough! Robert found, after even more Googling, an existence proof, in an unexpected place: an open-source package, Moman, under the generous MIT license. The author, Jean-Phillipe Barrette-LaPierre, had somehow, incredibly, magically, quietly, implemented the algorithm from this paper. And this was apparently a random side project for him, unrelated to his day job. So now we knew it was possible (and we all have deep admiration for Jean-Phillipe!).
We decided to simply re-use Moman's implementation to accomplish our goals. But, it turns out, its source code is all Python (my favorite programming language)! And, nearly as hairy as the paper itself. Nevertheless, we pushed on.
Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code, to auto-generate Java code containing the necessary tables for each max edit distance case (N=1, N=2, etc.). We had to guess what each Python function did, by its name, trying to roughly match this up to the spooky terminology in the paper.
The result was createLevAutomata.py: it auto-generates crazy looking Java code (see Lev2ParametricDescription.java, and scroll to the cryptic packed tables at the bottom), which in turn is used by further Java code to create the Levenshtein automaton per-query. We only generate the N=1 and N=2 cases (the N>=3 cases aren't really practical, at least not yet).
The last bug...
Realize, now, what a crazy position we were in. We wrote our own scary Python code, tapping into various functions in the Moman package, to auto-generate unreadable Java code with big tables of numbers, which is then used to generate Levenshtein automata from the base term and N. We went through many iterations with this crazy chain of Python and Java code that we barely understood, slowly iterating to get the bugs out.
After fixing many problems, we still had one persistent bug which we just couldn't understand, let alone fix. We struggled for several days, assuming the bug was in our crazy Python/Java chain. Finally, we considered the possibility that the bug was in Moman, and indeed Robert managed to reduce the problem to a tiny Python-only case showing where Moman failed to match the right terms. Robert sent this example to Jean-Phillipe, who quickly confirmed the bug and posted a patch the next day. We applied his patch and suddenly everything was working perfectly!
Fortunately, while this fast FuzzyQuery was unbelievably hairy to implement, testing it well is relatively easy since we can validate it against the brute-force enumeration from 3.0. We have several tests verifying the different layers executed by the full FuzzyQuery. The tests are exhaustive in that they test all structurally different cases possible in the Levenshtein construction, using a binary (only characters 0 and 1) terms.
Beyond just solving this nearly impossible task of efficiently compiling a term to a Levenshtein Automaton, we had many other parts to fill in. For example, Robert separately created a general AutomatonQuery, re-using infrastructure from the open-source Brics automaton package, to enable fast intersection of an automaton against all terms and documents in the index. This query is now used to handle WildcardQuery, RegexpQuery, and FuzzyQuery. It's also useful for custom cases, too; for example it's used by Solr to reverse wildcard queries. These slides from Robert describe AutomatonQuery, and its fun possible use case, in more detail.
Separately, we had an impedance mismatch: these automatons speak full unicode (UTF32) characters, yet Lucene's terms are stored in UTF8 bytes, so we had to create a UTF32 -> UTF8 automaton converter, which by itself was also very hairy! That converter translates any UTF32 automaton into an equivalent UTF8 Levenshtein automaton, which can be directly intersected against the terms in the index.
So, today, when you run a FuzzyQuery in 4.0, it efficiently seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton. This, coupled with ongoing performance improvements to seeking and scanning terms, as well as other major improvements like performing MultiTermQuery rewrites per-segment, has given us the astounding overall gains in FuzzyQuery.
Thanks to these enormous performance improvements, Robert has created an entirely new automaton spell checker that uses this same algorithm to find candidate terms for respelling. This is just like FuzzyQuery, except it doesn't visit the matching documents. This is a big improvement over the existing spellchecker as it does not require a separate spellchecker index be maintained.
This whole exciting experience is a great example of why open-source development works so well. Here we have diverse committers from Lucene/Solr, bringing together their various unusual strengths (automatons, Unicode, Python, etc.) to bear on an insanely hard challenge, leveraging other potent open-source packages including Moman and Brics, iterating with the authors of these packages to resolve bugs. No single person involved in this really understands all of the parts; it's truly a team effort.
And now you know what's going on under the hood when you see incredible speedups with FuzzyQuery in 4.0!
[For the not-faint-of-heart, you can browse LUCENE-1606 to see parts of this story unfolding through Jira]
Thursday, March 24, 2011
Friday, March 11, 2011
Hack on Lucene this summer!
Are you a student? Looking to do some fun coding this summer? Then join us for the 2011 Google Summer of Code!
The application deadline is in less than a month! Lucene has these initial
potential projects identified, but you can also pick your own; just be sure to discuss with the community first (send an email to dev@lucene.apache.org).
The application deadline is in less than a month! Lucene has these initial
potential projects identified, but you can also pick your own; just be sure to discuss with the community first (send an email to dev@lucene.apache.org).
Saturday, February 12, 2011
So many icicles
It's that cold time of year again -- that's right, winter! We have lots of snow, freezing temperatures, and... incredible icicles hanging off houses:

Conditions have to be just right for these icicles to form. First, you need lots of snow accumulated on roofs. Second, you need below-freezing temperatures for many days in a row. Finally, of course, you need a house that loses lots of heat through its attic/roof.

These icicles are an easy way to spot houses that waste heat, something that's otherwise not normally easy to detect! In the winter, here in New England, such houses stand out very clearly.

Of course, roof snow does also melt for legitimate reasons. For example, when the temperature outside moves above freezing, the snow will melt. However, the resulting water simply falls off the roof. It's only when ambient temperature is below freezing, yet the snow is still being melted (from waste heat leaking through the roof) that you get immense icicles. The water dribbles down the roof, under the snow, and upon hitting the roof's edge / gutter overhang, which is not wasting heat, it freezes.
Given enough snow, wasted heat, below freezing temperatures, and time, you'll get amazing icicles. I can understand that older homes will have poorer insulation and thus waste heat: standards were more lax back then, and we generally were not as environmentally conscious as we are today. But, when I see massive icicles on new construction, it's disappointing:

These icicles are very firmly attached to the roof/gutter, since Mother Nature carefully froze them there one drop at a time. However, as pretty as they are, these so-called ice dams can do massive damage. There are all sorts of products and services out there to try to make them go away. But the best solution, of course, is to simply prevent them from developing in the first place, by addressing the root cause: stop wasting heat. You can add insulation to your attic and/or turn down the thermostat.

Conditions have to be just right for these icicles to form. First, you need lots of snow accumulated on roofs. Second, you need below-freezing temperatures for many days in a row. Finally, of course, you need a house that loses lots of heat through its attic/roof.

These icicles are an easy way to spot houses that waste heat, something that's otherwise not normally easy to detect! In the winter, here in New England, such houses stand out very clearly.

Of course, roof snow does also melt for legitimate reasons. For example, when the temperature outside moves above freezing, the snow will melt. However, the resulting water simply falls off the roof. It's only when ambient temperature is below freezing, yet the snow is still being melted (from waste heat leaking through the roof) that you get immense icicles. The water dribbles down the roof, under the snow, and upon hitting the roof's edge / gutter overhang, which is not wasting heat, it freezes.
Given enough snow, wasted heat, below freezing temperatures, and time, you'll get amazing icicles. I can understand that older homes will have poorer insulation and thus waste heat: standards were more lax back then, and we generally were not as environmentally conscious as we are today. But, when I see massive icicles on new construction, it's disappointing:

These icicles are very firmly attached to the roof/gutter, since Mother Nature carefully froze them there one drop at a time. However, as pretty as they are, these so-called ice dams can do massive damage. There are all sorts of products and services out there to try to make them go away. But the best solution, of course, is to simply prevent them from developing in the first place, by addressing the root cause: stop wasting heat. You can add insulation to your attic and/or turn down the thermostat.
Friday, February 11, 2011
Visualizing Lucene's segment merges
If you've ever wondered how Lucene picks segments to merge during indexing, it looks something like this:
That video displays segment merges while indexing the entire Wikipedia (English) export (29 GB plain text), played back at ~8X real-time.
Each segment is a bar, whose height is the size (in MB) of the segment (log-scale). Segments on the left are largest; as new segments are flushed, they appear on the right. Segments being merged are colored the same color and, once the merge finishes, are removed and replaced with the new (larger) segment. You can see the nice logarithmic staircase pattern that merging creates.
By default, using ConcurrentMergeScheduler, Lucene executes each merge in a separate thread, allowing multiple merges to run at once without blocking ongoing indexing. The bigger the merge the longer it takes to finish.
One simple metric you can use to measure overall merge cost is to divide the total number of bytes read/written for all merging by the final byte size of an index; smaller values are better. This is analogous to the write amplification measure that solid-state disks use, in that your app has written X MB but because of merging and deleted documents overhead, Lucene had to internally read and write some multiple of X. You can think of this write amplification as a tax on your indexing; you don't pay this tax up front, when the document is first indexed, but only later as you continue to add documents to the index. The video shows the total size of the index as well as net bytes merged, so it's easy to compute write amplification for the above run: 6.19 (final index size was 10.87 GB and net bytes copied during merging was 67.30 GB).
Proper merge selection is actually a tricky problem, in general, because we must carefully balance not burning CPU/IO (due to inefficient merge choices), while also not allowing too many segments to accumulate in the index, as this slows down search performance. To minimize merge cost, you ideally would merge only equal-sized segments, and merge a larger number of segments at a time.
In fact, from the viewpoint of the MergePolicy, this is really a game against a sneaky opponent who randomly makes sudden changes to the index, such as flushing new segments or applying new deletions. If the opponent is well behaved, it'll add equal sized, large segments, which are easy to merge well, as was the case in the above video; but that's a really easy game, like playing tic-tack-toe against a 3 year old.
This opponent is more like playing a game of chess:
No more nice looking staircase! This test shows the more challenging near-real-time use case, which calls updateDocument (= delete + add) at a high rate and frequently opens a new reader (creating a new segment each time). The dark gray band on top of each segment shows the proportion of deletions in that segment. When you delete a document in Lucene, the bytes consumed by that document are not reclaimed until the segment is merged, and you can see old segments being eroded as new segments are appended to the index. Unfortunately, Lucene's current default LogByteSizeMergePolicy struggles to pick good merges against this opponent, often merging irregularly sized segments.
The big issue with LogByteSizeMergePolicy is that it must pick adjacent segments for merging. However, we recently relaxed this longstanding limitation, and I'm working on a new merge policy, TieredMergePolicy (currently a patch on LUCENE-854) to take advantage of this. TieredMergePolicy also fixes some other limitations of LogByteSizeMergePolicy, such as merge cascading that results in occasionally "inadvertently optimizing" the index as well as the overly coarse control it offers over the maximum segment size.
TieredMergePolicy first computes the allowed "budget" of how many segments should be in the index, by counting how many steps the "perfect logarithmic staircase" would require given total index size, minimum segment size (floored), mergeAtOnce, and a new configuration maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be.
Whenever the index is over-budget, it selects the best merge. Potential merges are scored with a combination of skew (basically how "equally sized" the segments being merged are), total size (smaller merges are favored), and how many deleted documents will be reclaimed. It also tries to merge to the exact maximum segment size (default 5GB).
Here's the same difficult near-real-time test, this time using TieredMergePolicy instead:
Note how the segments are now sorted by size, since TieredMergePolicy is allowed to merge non-adjacent segments. For the above difficult run, the write amplification for Lucene's current default merge policy (LogByteSizeMergePolicy) was 14.49 while the new merge policy (TieredMergePolicy) was 13.64, a nice improvement, though not as much as I was expecting. I suspect this is because TieredMergePolicy works hard to hit the max segment size (5 GB), resulting in 6 maximum sized segments while LogByteSizeMergePolicy had only 3. These numbers are much higher than the 6.19 write amplification from the "easy" merging, since that merging was about as efficient as we can hope for.
While TieredMergePolicy is a good improvement over LogByteSizeMergePolicy, it's still theoretically possible to do even better! In particular, TieredMergePolicy is greedy in its decision making: it only looks statically at the index, as it exists right now, and always chooses what looks like the best merge, not taking into account how this merge will affect future merges nor what further changes the opponent is likely to make to the index. This is good, but it's not guaranteed to produce the optimal merge sequence. For any series of changes made by the opponent there is necessarily a corresponding perfect sequence of merges, that minimizes net merge cost while obeying the budget. If instead the merge policy used a search with some lookahead, such as the Minimax algorithm, it could do a better job setting up more efficient future merges. I suspect this theoretical gain is likely small in practice; but if there are any game theorists out there reading this now, I'd love to be proven wrong!
I generated these movies with this simple Python script. It parses the infoStream output from IndexWriter, renders one frame at a time, saved as a PNG file in the local file system, using the Python Imaging Library, and finally encodes all frames into a video using MEncoder with the X264 codec.
That video displays segment merges while indexing the entire Wikipedia (English) export (29 GB plain text), played back at ~8X real-time.
Each segment is a bar, whose height is the size (in MB) of the segment (log-scale). Segments on the left are largest; as new segments are flushed, they appear on the right. Segments being merged are colored the same color and, once the merge finishes, are removed and replaced with the new (larger) segment. You can see the nice logarithmic staircase pattern that merging creates.
By default, using ConcurrentMergeScheduler, Lucene executes each merge in a separate thread, allowing multiple merges to run at once without blocking ongoing indexing. The bigger the merge the longer it takes to finish.
One simple metric you can use to measure overall merge cost is to divide the total number of bytes read/written for all merging by the final byte size of an index; smaller values are better. This is analogous to the write amplification measure that solid-state disks use, in that your app has written X MB but because of merging and deleted documents overhead, Lucene had to internally read and write some multiple of X. You can think of this write amplification as a tax on your indexing; you don't pay this tax up front, when the document is first indexed, but only later as you continue to add documents to the index. The video shows the total size of the index as well as net bytes merged, so it's easy to compute write amplification for the above run: 6.19 (final index size was 10.87 GB and net bytes copied during merging was 67.30 GB).
Proper merge selection is actually a tricky problem, in general, because we must carefully balance not burning CPU/IO (due to inefficient merge choices), while also not allowing too many segments to accumulate in the index, as this slows down search performance. To minimize merge cost, you ideally would merge only equal-sized segments, and merge a larger number of segments at a time.
In fact, from the viewpoint of the MergePolicy, this is really a game against a sneaky opponent who randomly makes sudden changes to the index, such as flushing new segments or applying new deletions. If the opponent is well behaved, it'll add equal sized, large segments, which are easy to merge well, as was the case in the above video; but that's a really easy game, like playing tic-tack-toe against a 3 year old.
This opponent is more like playing a game of chess:
No more nice looking staircase! This test shows the more challenging near-real-time use case, which calls updateDocument (= delete + add) at a high rate and frequently opens a new reader (creating a new segment each time). The dark gray band on top of each segment shows the proportion of deletions in that segment. When you delete a document in Lucene, the bytes consumed by that document are not reclaimed until the segment is merged, and you can see old segments being eroded as new segments are appended to the index. Unfortunately, Lucene's current default LogByteSizeMergePolicy struggles to pick good merges against this opponent, often merging irregularly sized segments.
The big issue with LogByteSizeMergePolicy is that it must pick adjacent segments for merging. However, we recently relaxed this longstanding limitation, and I'm working on a new merge policy, TieredMergePolicy (currently a patch on LUCENE-854) to take advantage of this. TieredMergePolicy also fixes some other limitations of LogByteSizeMergePolicy, such as merge cascading that results in occasionally "inadvertently optimizing" the index as well as the overly coarse control it offers over the maximum segment size.
TieredMergePolicy first computes the allowed "budget" of how many segments should be in the index, by counting how many steps the "perfect logarithmic staircase" would require given total index size, minimum segment size (floored), mergeAtOnce, and a new configuration maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be.
Whenever the index is over-budget, it selects the best merge. Potential merges are scored with a combination of skew (basically how "equally sized" the segments being merged are), total size (smaller merges are favored), and how many deleted documents will be reclaimed. It also tries to merge to the exact maximum segment size (default 5GB).
Here's the same difficult near-real-time test, this time using TieredMergePolicy instead:
Note how the segments are now sorted by size, since TieredMergePolicy is allowed to merge non-adjacent segments. For the above difficult run, the write amplification for Lucene's current default merge policy (LogByteSizeMergePolicy) was 14.49 while the new merge policy (TieredMergePolicy) was 13.64, a nice improvement, though not as much as I was expecting. I suspect this is because TieredMergePolicy works hard to hit the max segment size (5 GB), resulting in 6 maximum sized segments while LogByteSizeMergePolicy had only 3. These numbers are much higher than the 6.19 write amplification from the "easy" merging, since that merging was about as efficient as we can hope for.
While TieredMergePolicy is a good improvement over LogByteSizeMergePolicy, it's still theoretically possible to do even better! In particular, TieredMergePolicy is greedy in its decision making: it only looks statically at the index, as it exists right now, and always chooses what looks like the best merge, not taking into account how this merge will affect future merges nor what further changes the opponent is likely to make to the index. This is good, but it's not guaranteed to produce the optimal merge sequence. For any series of changes made by the opponent there is necessarily a corresponding perfect sequence of merges, that minimizes net merge cost while obeying the budget. If instead the merge policy used a search with some lookahead, such as the Minimax algorithm, it could do a better job setting up more efficient future merges. I suspect this theoretical gain is likely small in practice; but if there are any game theorists out there reading this now, I'd love to be proven wrong!
I generated these movies with this simple Python script. It parses the infoStream output from IndexWriter, renders one frame at a time, saved as a PNG file in the local file system, using the Python Imaging Library, and finally encodes all frames into a video using MEncoder with the X264 codec.
Saturday, January 8, 2011
Finite State Transducers, Part 2
In my last post, I described a cool incremental algorithm for building an FST from pre-sorted input/output pairs, and how we're folding it into Lucene.
Progress! The patch on LUCENE-2792 is now committed to Lucene's trunk (eventually 4.0), so that we now use an FST to hold all terms in RAM for the SimpleText codec.
This was a great step forward, and it added the raw low-level infrastructure for building and using FSTs. But, it's really just a toy usage, since the SimpleText codec is not for production use.
I'm happy to report even more progress: we finally have a "real" usage for FSTs in Lucene! With LUCENE-2843 (now committed), we use an FST to hold the terms index in RAM.
Many operations in Lucene require looking up metadata for a requested term. This information includes the number of documents containing the term, file pointers into postings files that actually store the docIDs and positions, etc. Because there could be many unique terms, all this term metadata resides on-disk in the terms dictionary file (_X.tis).
Looking up a term is then a two-step process. First, we consult the terms index, which resides entirely in RAM, to map the requested term to a file pointer in the terms dictionary file. Second, we scan the terms in the on-disk terms dictionary file starting from that file pointer, to look for an exact match.
Certain queries, such as TermQuery, need only one exact term lookup. Others, such as FuzzyQuery or the new, very cool automaton spellchecker, which does not require a separate spellchecker index, perform many term lookups (potentially millions).
Before LUCENE-2843, in trunk, the terms index used packed byte[] and int[] arrays. In fact, this was already a huge improvement over the terms index in 3.0, but is still more wasteful than an FST since each term was stored in fully expanded form, while the FST shares common prefixes and suffixes. Getting this working also required adding separate seekFloor and seekCeil methods to the FSTEnum classes (like a SortedMap<T>).
To test this, I indexed the first 10 million 1KB documents derived from Wikipedia's English database download. The resulting RAM required for the FST was ~38% - 52% smaller (larger segments see more gains, as the FST "scales up" well). Not only is the RAM required much lower, but term lookups are also faster: the FuzzyQuery united~2 was ~22% faster.
This is very much win/win, so we've made this terms index the default for all core codecs (the terms dictionary and terms index are pluggable, so its easy for other codecs to use this as well).
There are many other ways we can use FSTs in Lucene, and we've only just scratched the surface here. In fact, FSTs offers such a sizable RAM reduction that I think for many, but not all, apps it'd be realistic to avoid the two-step term lookup process entirely and simply hold the entire terms dictionary in RAM. This should make term lookup intensive queries potentially much faster, though we'd likely have to rework them to use different algorithms optimized for iterating directly through the terms as an FST.
Progress! The patch on LUCENE-2792 is now committed to Lucene's trunk (eventually 4.0), so that we now use an FST to hold all terms in RAM for the SimpleText codec.
This was a great step forward, and it added the raw low-level infrastructure for building and using FSTs. But, it's really just a toy usage, since the SimpleText codec is not for production use.
I'm happy to report even more progress: we finally have a "real" usage for FSTs in Lucene! With LUCENE-2843 (now committed), we use an FST to hold the terms index in RAM.
Many operations in Lucene require looking up metadata for a requested term. This information includes the number of documents containing the term, file pointers into postings files that actually store the docIDs and positions, etc. Because there could be many unique terms, all this term metadata resides on-disk in the terms dictionary file (_X.tis).
Looking up a term is then a two-step process. First, we consult the terms index, which resides entirely in RAM, to map the requested term to a file pointer in the terms dictionary file. Second, we scan the terms in the on-disk terms dictionary file starting from that file pointer, to look for an exact match.
Certain queries, such as TermQuery, need only one exact term lookup. Others, such as FuzzyQuery or the new, very cool automaton spellchecker, which does not require a separate spellchecker index, perform many term lookups (potentially millions).
Before LUCENE-2843, in trunk, the terms index used packed byte[] and int[] arrays. In fact, this was already a huge improvement over the terms index in 3.0, but is still more wasteful than an FST since each term was stored in fully expanded form, while the FST shares common prefixes and suffixes. Getting this working also required adding separate seekFloor and seekCeil methods to the FSTEnum classes (like a SortedMap<T>).
To test this, I indexed the first 10 million 1KB documents derived from Wikipedia's English database download. The resulting RAM required for the FST was ~38% - 52% smaller (larger segments see more gains, as the FST "scales up" well). Not only is the RAM required much lower, but term lookups are also faster: the FuzzyQuery united~2 was ~22% faster.
This is very much win/win, so we've made this terms index the default for all core codecs (the terms dictionary and terms index are pluggable, so its easy for other codecs to use this as well).
There are many other ways we can use FSTs in Lucene, and we've only just scratched the surface here. In fact, FSTs offers such a sizable RAM reduction that I think for many, but not all, apps it'd be realistic to avoid the two-step term lookup process entirely and simply hold the entire terms dictionary in RAM. This should make term lookup intensive queries potentially much faster, though we'd likely have to rework them to use different algorithms optimized for iterating directly through the terms as an FST.
Friday, December 3, 2010
Using Finite State Transducers in Lucene
FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. They also look cool:

That FST maps the sorted words mop, moth, pop, star, stop and top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.
Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.
There's a great deal of theory behind FSTs. They generally support the same operations as FSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.
There are some nice general-purpose FST toolkits (OpenFst looks great) that support all these operations, but for Lucene I decided to implement this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.
The resulting implementation (currently a patch on LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a prefix trie, pruning by how many terms come through each node, with even less memory.
Note that because addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.
In the patch I modified the SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!
There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!
Many thanks to Dawid Weiss for helping me iterate on this.

That FST maps the sorted words mop, moth, pop, star, stop and top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.
Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.
There's a great deal of theory behind FSTs. They generally support the same operations as FSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.
There are some nice general-purpose FST toolkits (OpenFst looks great) that support all these operations, but for Lucene I decided to implement this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.
The resulting implementation (currently a patch on LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a prefix trie, pruning by how many terms come through each node, with even less memory.
Note that because addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.
In the patch I modified the SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!
There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!
Many thanks to Dawid Weiss for helping me iterate on this.
Wednesday, November 3, 2010
Big Fat Fiasco
I just watched this talk by Tom Naughton, describing the mis-steps and bad science over the years that have tricked us all into believing we should avoid fat and cholesterol in our diet when in fact we should be doing the exact opposite!
Tom is an excellent speaker (he's a comedian!), mixing in humor in what is at heart a very sad topic. He details the science that should have led us to conclude that excessive carbs and sugar consumption are in fact the root cause behind heart disease, diabetes and obesity, not fat and cholesterol as we've been incessantly told over the years.
He shows how the "Fat Lazy Slob Theory" (also called "calories in minus calories out") so frequently put forth to explain weight gain is in fact wrong, and that instead the root cause is the biochemistry in your body driving you to eat when your blood sugar is too high.
Tom's documentary movie Fat Head, which is getting great reviews on Amazon, delves into these same topics. I haven't watched it yet but I plan to.
So enjoy your butter, cheese, whole milk, eggs (with yolk!!) and cut back on sugars and carbs. And don't drink soda!
Tom is an excellent speaker (he's a comedian!), mixing in humor in what is at heart a very sad topic. He details the science that should have led us to conclude that excessive carbs and sugar consumption are in fact the root cause behind heart disease, diabetes and obesity, not fat and cholesterol as we've been incessantly told over the years.
He shows how the "Fat Lazy Slob Theory" (also called "calories in minus calories out") so frequently put forth to explain weight gain is in fact wrong, and that instead the root cause is the biochemistry in your body driving you to eat when your blood sugar is too high.
Tom's documentary movie Fat Head, which is getting great reviews on Amazon, delves into these same topics. I haven't watched it yet but I plan to.
So enjoy your butter, cheese, whole milk, eggs (with yolk!!) and cut back on sugars and carbs. And don't drink soda!
Monday, October 25, 2010
Our medical system is a house of cards
I just came across this great article about meta-researcher Dr. John Ioannidis. Here's the summary:
Much of what medical researchers conclude in their studies is misleading, exaggerated, or flat-out wrong. So why are doctors—to a striking extent—still drawing upon misinformation in their everyday practice? Dr. John Ioannidis has spent his career challenging his peers by exposing their bad science.
The gist is that modern medical research is deeply flawed and biased such that the "conclusions" that you and I eventually read in the news headlines are often false. I especially love his advice for us all:
Ioannidis suggests a simple approach: ignore them all
This is in fact my approach! I have a simple rule: if it tastes good it's good for you. So I eat plenty of fat, salt, sugar, cholesterol, carbs, etc. I love eggs and cheese and I always avoid low-fat or low-cholesterol foods. I get lots of sun and never use sun screen. I drink coffee and beer, daily. I drink lots of water. I get daily exercise, running and walking. And I avoid hand sanitizers like Purell (I believe commonplace dirt/germs are in fact natural and good for you). I strongly believe humans do not need pills to stay healthy. I don't take a daily vitamin. And I'm very healthy!
This short interview between Discover Magazine and Harvard clinician John Abramson echoes the same core problem. Here's a choice quote:
When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.
Unfortunately, the medical world has a deep, deep conflict of interest: healthy people do not generate profits. Capitalism is a horrible match to health care.
So, next time your doctor prescribes a fancy new cool-sounding powerful drug like Alevia or Omosia or Nanotomopia or whatever, try to remember that our medical system is really built on a house of cards. Your doctor, let alone you, cannot possibly differentiate what's true from what's false. Don't trust that large triple-blind random controlled trial that supposedly validated this cool new drug. You are the guinea pig! And it's only when these drugs cause all sorts of problems once they are really tested on the population at large that their true colors are revealed.
Much of what medical researchers conclude in their studies is misleading, exaggerated, or flat-out wrong. So why are doctors—to a striking extent—still drawing upon misinformation in their everyday practice? Dr. John Ioannidis has spent his career challenging his peers by exposing their bad science.
The gist is that modern medical research is deeply flawed and biased such that the "conclusions" that you and I eventually read in the news headlines are often false. I especially love his advice for us all:
Ioannidis suggests a simple approach: ignore them all
This is in fact my approach! I have a simple rule: if it tastes good it's good for you. So I eat plenty of fat, salt, sugar, cholesterol, carbs, etc. I love eggs and cheese and I always avoid low-fat or low-cholesterol foods. I get lots of sun and never use sun screen. I drink coffee and beer, daily. I drink lots of water. I get daily exercise, running and walking. And I avoid hand sanitizers like Purell (I believe commonplace dirt/germs are in fact natural and good for you). I strongly believe humans do not need pills to stay healthy. I don't take a daily vitamin. And I'm very healthy!
This short interview between Discover Magazine and Harvard clinician John Abramson echoes the same core problem. Here's a choice quote:
When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.
Unfortunately, the medical world has a deep, deep conflict of interest: healthy people do not generate profits. Capitalism is a horrible match to health care.
So, next time your doctor prescribes a fancy new cool-sounding powerful drug like Alevia or Omosia or Nanotomopia or whatever, try to remember that our medical system is really built on a house of cards. Your doctor, let alone you, cannot possibly differentiate what's true from what's false. Don't trust that large triple-blind random controlled trial that supposedly validated this cool new drug. You are the guinea pig! And it's only when these drugs cause all sorts of problems once they are really tested on the population at large that their true colors are revealed.
Sunday, October 17, 2010
Pics from BBQ after Lucene Revolution
I finally pulled the pics off my camera from last week's BBQ after Lucene Revolution in Boston, where much fun was had! See them here. It was awesome to finally meet everyone!
Saturday, October 9, 2010
Fun with flexible indexing
The Lucene Revolution conference just wrapped up yesterday. It was well attended (~300 or so people). It was great fun to hear about all the diverse ways that Lucene and Solr are being used in the real world.
I gave a talk about flexible indexing, coming in the next major release of Lucene (4.0). Slides are here.
I gave a talk about flexible indexing, coming in the next major release of Lucene (4.0). Slides are here.
Tuesday, October 5, 2010
Lucene's SimpleText codec
Inspired by this question on the Lucene user's list, I created a new codec in Lucene called the SimpleText codec. The best ideas come from the user's lists!
This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0). Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...
Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:
The codec is read/write, and fully functional. All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).
Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc. So don't use this one in production!
But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.
This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0). Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...
Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:
field contents
term file
doc 0
pos 5
term is
doc 0
pos 1
term second
doc 0
pos 3
term test
doc 0
pos 4
term the
doc 0
pos 2
term this
doc 0
pos 0
END
The codec is read/write, and fully functional. All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).
Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc. So don't use this one in production!
But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.
Tuesday, September 28, 2010
Track your home's live electricity usage with Python
Modern electric meters (the one attached to the outside of your house) emit an IR flash for each watt-hour consumed, through the port on the top of the meter. It turns out, it's easy to detect this flash, decode it into "live" power usage, and make cool charts like this:

That's live KW on the Y axis and time on the X axis.
The flash, at least for my meter, appears to have high temporal accuracy, meaning it flashes precisely as you cross 1.000 WH. This is great, since it enables accurate, real-time power usage. For example, when I turn on the lights in my home office, I quickly see the power jump by ~65 watts, and then drop again when I turn the lights off.
This is a fun way to track down which appliances or crazy computers or parasitic drains are using up so much electricity in your house!
I've gone through several designs for this over the years. My last attempt was destroyed when lightning hit my house (there's always a silver lining!), so, I decided to try something new this time and I'm very happy with the approach since it's much simpler, and, moves the pulse detection into Python.
I use a trivial analog circuit to detect the IR pulse, consisting of two 100K resistors in series and an IR photodiode in parallel with one of the resistors. Make sure you get the polarity right on the photodiode otherwise it won't work! Mount the photodiode on your power meter, aligned so that it "sees" each IR pulse. I also added a small (0.01 uF) ceramic capacitor in parallel with the same resistor, to suppress transient EM fields otherwise picked up by the relatively long wires out to my meter.
Finally, I use this simple USB audio adapter to move the problem into the digital domain, connecting to the ends of the two series resistors to the mic input to use it for the A/D conversion. This USB audio adapter drives the mic input with ~4.0V bias voltage, which is great since otherwise you'd need an external source.
When there's no pulse, the photo-diode acts roughly like an open switch, meaning there is a fixed 200K resistive load on the mic input. When a pulse happens (mine seem to last for ~10 msec), the photo-diode acts roughly like a closed switch, suddenly dropping the series resistance to 100K. This drop causes a very detectible pulse (strongly negative then strongly positive) on the mic input's voltage, I suspect because there's a capacitor behind the bias voltage (but: I am no analog circuit engineer, so this is speculation!).
You can plug this USB audio adapter into any computer; I use a Sheeva plug computer (delightful device, very low power -- I have three!). I record the digital samples (arecord works well, at a 2 Khz rate) and decode the samples in Python to detect a pulse whenever the value drops below -1000. You can easily compute the live KW based on the time between two adjacent pulses, push this into a database, and build graphs on top of this using Google's visualization APIs (I use dygraphs).
My last approach didn't have nearly the temporal accuracy (ie, it smoothed heavily across time), which masks interesting things. For example, now I can tell the difference between a resistive load (the coffee maker, oven, crockpot) and an inductive load (refrigerator compressor, vacuum cleaner) because the inductive load has a huge spike in the beginning, as the motor consumes lots of power trying to spin up.

That's live KW on the Y axis and time on the X axis.
The flash, at least for my meter, appears to have high temporal accuracy, meaning it flashes precisely as you cross 1.000 WH. This is great, since it enables accurate, real-time power usage. For example, when I turn on the lights in my home office, I quickly see the power jump by ~65 watts, and then drop again when I turn the lights off.
This is a fun way to track down which appliances or crazy computers or parasitic drains are using up so much electricity in your house!
I've gone through several designs for this over the years. My last attempt was destroyed when lightning hit my house (there's always a silver lining!), so, I decided to try something new this time and I'm very happy with the approach since it's much simpler, and, moves the pulse detection into Python.
I use a trivial analog circuit to detect the IR pulse, consisting of two 100K resistors in series and an IR photodiode in parallel with one of the resistors. Make sure you get the polarity right on the photodiode otherwise it won't work! Mount the photodiode on your power meter, aligned so that it "sees" each IR pulse. I also added a small (0.01 uF) ceramic capacitor in parallel with the same resistor, to suppress transient EM fields otherwise picked up by the relatively long wires out to my meter.
Finally, I use this simple USB audio adapter to move the problem into the digital domain, connecting to the ends of the two series resistors to the mic input to use it for the A/D conversion. This USB audio adapter drives the mic input with ~4.0V bias voltage, which is great since otherwise you'd need an external source.
When there's no pulse, the photo-diode acts roughly like an open switch, meaning there is a fixed 200K resistive load on the mic input. When a pulse happens (mine seem to last for ~10 msec), the photo-diode acts roughly like a closed switch, suddenly dropping the series resistance to 100K. This drop causes a very detectible pulse (strongly negative then strongly positive) on the mic input's voltage, I suspect because there's a capacitor behind the bias voltage (but: I am no analog circuit engineer, so this is speculation!).
You can plug this USB audio adapter into any computer; I use a Sheeva plug computer (delightful device, very low power -- I have three!). I record the digital samples (arecord works well, at a 2 Khz rate) and decode the samples in Python to detect a pulse whenever the value drops below -1000. You can easily compute the live KW based on the time between two adjacent pulses, push this into a database, and build graphs on top of this using Google's visualization APIs (I use dygraphs).
My last approach didn't have nearly the temporal accuracy (ie, it smoothed heavily across time), which masks interesting things. For example, now I can tell the difference between a resistive load (the coffee maker, oven, crockpot) and an inductive load (refrigerator compressor, vacuum cleaner) because the inductive load has a huge spike in the beginning, as the motor consumes lots of power trying to spin up.
Subscribe to:
Posts (Atom)