Changing Bits: Using Finite State Transducers in Lucene

Friday, December 3, 2010

Using Finite State Transducers in Lucene

FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. They also look cool:

That FST maps the sorted words mop, moth, pop, star, stop and top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.

Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.

There's a great deal of theory behind FSTs. They generally support the same operations as FSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.

There are some nice general-purpose FST toolkits (OpenFst looks great) that support all these operations, but for Lucene I decided to implement this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.

The resulting implementation (currently a patch on LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a prefix trie, pruning by how many terms come through each node, with even less memory.

Note that because addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.

In the patch I modified the SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!

There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!

Many thanks to Dawid Weiss for helping me iterate on this.

23 comments:

Michael KleenMay 24, 2013 at 10:53 PM
Hello, what does the fst with 69 mb for wikipedia contains ? Only the titles of the entries or the whole text as well ?
ReplyDelete
Replies
Michael McCandlessMay 25, 2013 at 10:51 AM
Hi Michael,

That FST held all unique terms (tokens) from indexing all English Wikipedia content.
ReplyDelete
Replies
kbrosAugust 1, 2013 at 8:08 PM
Hi Mike,
As the FST has a high CPU because of seeking, how bad are its performances affected by multiplying the number of existing terms?

Do you have an estimation of how is a typical query time distributed between the FST, seek in the term dictionary, calculating frequency or positions? How can I profile that in my own index?

Thanks!
ReplyDelete
Replies
UnknownOctober 3, 2013 at 12:56 PM
Thank you Michael for the quick answer,
For the intermediate automaton are you referring to the Levenstein Automaton, accepting all the strings with the expected distance from the query term ?
So directly building an automaton, accepting only Index terms with the expected distance from the the query term ?
ReplyDelete
Replies
UnknownOctober 4, 2013 at 7:15 AM
Yes , it makes sense and sounds reasonable :)
But is it feasible ?
I am starting learning about FSA and FST now, so I'm not yet into the implementation,
but one thing I remember is that Lucene FST implementation is so good and compact because it creates an immutable FST ( that becomes a sort of byte array under the hood) .
So creating an expandable FST is already possible or it will be a challenge ?
Can you suggest me other material to study ragarding this really interesting topic ? ( I followed your Solr revolution conference, some blog posts and videos so far :) )
ReplyDelete
Replies
AnonymousMarch 7, 2014 at 4:01 AM
Hey Mike,
Are we using fst to make inverted index? If yes, where the output of traversing a string in fst pointing to?
ReplyDelete
Replies
AnonymousSeptember 14, 2014 at 9:54 AM
I was wondering whether it is possible to associate ids with recognized strings in the automata. Say for instance I have a lexicon and each of its entry has a unique integer id. I'd like to compile the lexicon as an union string automata and whenever it recognizes an entry in a string, it outputs the unique id. It's like using the automata as a set of primary keys.
I could not find such an example of use online.
In a previous comment you mention MemoryPostingsFormat begin able to output ids (but document id), would it be possible to use that ? Thanks
ReplyDelete
Replies
AnonymousFebruary 11, 2015 at 5:12 AM
Thats astonishing...
ReplyDelete
Replies
UnknownMarch 20, 2017 at 3:06 PM
Hi michael,
I have a question regarding prefix queries. How expensive it is to do prefix queries.
We are using solr in our organization. Majority of our queries are becoming prefix queries and i am worried that it will effect the performance. since it is not searching at that point but more of a matching.

what do you suggest if majority of the queries are prefix then should these be handled on index time using something like EdgeNgram instead doing all this at query time.

Just to give you an idea , our index might have 500 million terms.
ReplyDelete
Replies
PussystrokerJuly 7, 2018 at 5:18 AM
This comment has been removed by the author.
ReplyDelete
Replies
PussystrokerJuly 7, 2018 at 5:20 AM
Hi Michael,
I have three newbie questions about the (your) FST:
1) "As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4."

For a given word (in your discussion, the word "stop"), are the output values assigned to arbitrary transition ("arcs")? Could you have instead assign: "stop hits 3 on the t and 1 on the o, so its output ordinal is 4."?

2) From the start state, why do transitions "p/2" and "t/5" going to the same state?

3) For simplicity, let's ignore stemming for a sec, and suppose we index a new word "popstar", do we simply introduce an arc going from end state back to the state which "s/3" points to?
ReplyDelete
Replies
UnknownSeptember 3, 2018 at 9:14 PM
Hi Michael,
Is it possible that the FST takes regular expressions so that any sequence matching one regular expression can be transformed to an output. Could you please give me a few pointers please. Thank you!
ReplyDelete
Replies

Add comment