Sunday, June 9, 2013

Build your own finite state transducer

Have you always wanted your very own Lucene finite state transducer (FST) but you couldn't figure out how to use Lucene's crazy APIs?

Then today is your lucky day! I just built a simple web application that creates an FST from the input/output strings that you enter.

If you just want a finite state automaton (no outputs) then enter only inputs, such as this example:



If all of your outputs are non-negative integers then the FST will use numeric outputs, where you sum up the outputs as you traverse a path to get the final output:

Finally, if the outputs are non-numeric then they are treated as strings, in which case you concatenate as you traverse the path:

The red arcs are the ones with the NEXT optimization: these arcs do not store a pointer to a node because their to-node is the very next node in the FST. This is a good optimization: it generally results in a large reduction of the FST size. The bolded arcs tell you the next node is final; this is most interesting when a prefix of another input is accepted, such as this example:



Here the "r" arc is bolded, telling you that "star" is accepted. Furthermore, that node following the "r" arc has a final output, telling you the overall output for "star" is "abc".

The web app is a simple Python WSGI app; source code is here. It invokes a simple Java tool as a subprocess; source code (including generics violations!) is here.

8 comments:

  1. Hello Mr McCandless,

    Nice blog! Is there an email address I can contact you in private?

    Thanks,

    Eleftheria Kiourtzoglou


    Head of Editorial Team

    Java Code Geeks
    email: eleftheria[dot]kiourtzoglou[at]javacodegeeks[dot]com

    ReplyDelete
  2. is it possible to optimize the chains "cde" and "xcde" of the entries "abxcdey abicdej" into one "cde"?
    look: http://examples.mikemccandless.com/fst.py?terms=abxcdey%0D%0Aabicdej&cmd=Build+it%21

    ReplyDelete
    Replies
    1. No, FSTs cannot do that, because those two "cde" carry different state since they complete to different suffixes.

      If they both completed to the same set of suffixes then "cde" is shared, e.g.:

      http://examples.mikemccandless.com/fst.py?terms=abxcdey%0D%0Aabicdej%0D%0Aabxcdej%0D%0Aabicdey&cmd=Build+it%21

      Delete
  3. Excellent article. Any thoughts how this can be used for key/value store kind of application

    ReplyDelete
    Replies
    1. FSTs can easily store key/value pairs, since they really act like a SortedMap, however this is only "useful" if 1) you can fit the entire map into RAM, and 2) the keys have commonalities to them (e.g. words from natural language) in which case they compress well.

      Delete
  4. Hello Michael, how are you?

    Hope you can help me. In the Lucene API have the ability to change a document, removes and adds the document. I have the need to add/remove a term by docId/field. There is the possibility to perform the link between a term with a field and its existing document? (field -> terms -> term -> DocIds)

    writer.removeTerm (docId, field, term);
    writer.addTerm (docId, field, term);

    thank you,
    Marcio

    ReplyDelete
    Replies
    1. Hi Marcio,

      Can you send this question to java-user@lucene.apache.org instead?

      Thanks.

      Delete