Saturday, January 8, 2011

Finite State Transducers, Part 2

In my last post, I described a cool incremental algorithm for building an FST from pre-sorted input/output pairs, and how we're folding it into Lucene.

Progress! The patch on LUCENE-2792 is now committed to Lucene's trunk (eventually 4.0), so that we now use an FST to hold all terms in RAM for the SimpleText codec.

This was a great step forward, and it added the raw low-level infrastructure for building and using FSTs. But, it's really just a toy usage, since the SimpleText codec is not for production use.

I'm happy to report even more progress: we finally have a "real" usage for FSTs in Lucene! With LUCENE-2843 (now committed), we use an FST to hold the terms index in RAM.

Many operations in Lucene require looking up metadata for a requested term. This information includes the number of documents containing the term, file pointers into postings files that actually store the docIDs and positions, etc. Because there could be many unique terms, all this term metadata resides on-disk in the terms dictionary file (_X.tis).

Looking up a term is then a two-step process. First, we consult the terms index, which resides entirely in RAM, to map the requested term to a file pointer in the terms dictionary file. Second, we scan the terms in the on-disk terms dictionary file starting from that file pointer, to look for an exact match.

Certain queries, such as TermQuery, need only one exact term lookup. Others, such as FuzzyQuery or the new, very cool automaton spellchecker, which does not require a separate spellchecker index, perform many term lookups (potentially millions).

Before LUCENE-2843, in trunk, the terms index used packed byte[] and int[] arrays. In fact, this was already a huge improvement over the terms index in 3.0, but is still more wasteful than an FST since each term was stored in fully expanded form, while the FST shares common prefixes and suffixes. Getting this working also required adding separate seekFloor and seekCeil methods to the FSTEnum classes (like a SortedMap<T>).

To test this, I indexed the first 10 million 1KB documents derived from Wikipedia's English database download. The resulting RAM required for the FST was ~38% - 52% smaller (larger segments see more gains, as the FST "scales up" well). Not only is the RAM required much lower, but term lookups are also faster: the FuzzyQuery united~2 was ~22% faster.

This is very much win/win, so we've made this terms index the default for all core codecs (the terms dictionary and terms index are pluggable, so its easy for other codecs to use this as well).

There are many other ways we can use FSTs in Lucene, and we've only just scratched the surface here. In fact, FSTs offers such a sizable RAM reduction that I think for many, but not all, apps it'd be realistic to avoid the two-step term lookup process entirely and simply hold the entire terms dictionary in RAM. This should make term lookup intensive queries potentially much faster, though we'd likely have to rework them to use different algorithms optimized for iterating directly through the terms as an FST.