Sunday, June 9, 2013

Build your own finite state transducer

Have you always wanted your very own Lucene finite state transducer (FST) but you couldn't figure out how to use Lucene's crazy APIs?

Then today is your lucky day! I just built a simple web application that creates an FST from the input/output strings that you enter.

If you just want a finite state automaton (no outputs) then enter only inputs, such as this example:



If all of your outputs are non-negative integers then the FST will use numeric outputs, where you sum up the outputs as you traverse a path to get the final output:

Finally, if the outputs are non-numeric then they are treated as strings, in which case you concatenate as you traverse the path:

The red arcs are the ones with the NEXT optimization: these arcs do not store a pointer to a node because their to-node is the very next node in the FST. This is a good optimization: it generally results in a large reduction of the FST size. The bolded arcs tell you the next node is final; this is most interesting when a prefix of another input is accepted, such as this example:



Here the "r" arc is bolded, telling you that "star" is accepted. Furthermore, that node following the "r" arc has a final output, telling you the overall output for "star" is "abc".

The web app is a simple Python WSGI app; source code is here. It invokes a simple Java tool as a subprocess; source code (including generics violations!) is here.

16 comments:

  1. Hello Mr McCandless,

    Nice blog! Is there an email address I can contact you in private?

    Thanks,

    Eleftheria Kiourtzoglou


    Head of Editorial Team

    Java Code Geeks
    email: eleftheria[dot]kiourtzoglou[at]javacodegeeks[dot]com

    ReplyDelete
  2. is it possible to optimize the chains "cde" and "xcde" of the entries "abxcdey abicdej" into one "cde"?
    look: http://examples.mikemccandless.com/fst.py?terms=abxcdey%0D%0Aabicdej&cmd=Build+it%21

    ReplyDelete
    Replies
    1. No, FSTs cannot do that, because those two "cde" carry different state since they complete to different suffixes.

      If they both completed to the same set of suffixes then "cde" is shared, e.g.:

      http://examples.mikemccandless.com/fst.py?terms=abxcdey%0D%0Aabicdej%0D%0Aabxcdej%0D%0Aabicdey&cmd=Build+it%21

      Delete
  3. Excellent article. Any thoughts how this can be used for key/value store kind of application

    ReplyDelete
    Replies
    1. FSTs can easily store key/value pairs, since they really act like a SortedMap, however this is only "useful" if 1) you can fit the entire map into RAM, and 2) the keys have commonalities to them (e.g. words from natural language) in which case they compress well.

      Delete
  4. Hello Michael, how are you?

    Hope you can help me. In the Lucene API have the ability to change a document, removes and adds the document. I have the need to add/remove a term by docId/field. There is the possibility to perform the link between a term with a field and its existing document? (field -> terms -> term -> DocIds)

    writer.removeTerm (docId, field, term);
    writer.addTerm (docId, field, term);

    thank you,
    Marcio

    ReplyDelete
    Replies
    1. Hi Marcio,

      Can you send this question to java-user@lucene.apache.org instead?

      Thanks.

      Delete
  5. Hello Mr. McCandless,

    Java Tool link now points to a 404 not found page. Can you give me a valid link?

    Your posts about FST really help me a lot. Thank you.

    ReplyDelete
    Replies
    1. Hi Vicky,

      It seems to be working now; try again?

      Delete
  6. Hi,
    very interesting article. While the web app is still working, the source code is no longer accessible (404-error). Any chance to fix that?
    Thanks, Daniel

    ReplyDelete
    Replies
    1. Hi Anonymous,

      Woops, thanks for notifying me ... I just fixed those links so they should now work again!

      Delete
  7. nice article...btw, could you please guide me how to run the web app and java tool (both source codes are attached by you) in my Windows 7 system?...pls mail me at manabs21@gmail.com

    ReplyDelete
  8. Hi manao',

    I don't use Windows so I really can't help much here, but the code (Java, Python) should be portable... or you can just use the instance that I keep running at http://examples.mikemccandless.com/fst.py

    ReplyDelete
    Replies
    1. then please guide me how to run that on Ubuntu(I think you use that platform) ..I would be grateful to you...btw, http://examples.mikemccandless.com/fst.py link is not working.

      Delete
    2. Have a look at fstApp.py: this is a standard Python WSGI app. You can run it directly with python for testing, or you can run it e.g. with mod_wsgi using Apache (this is how I run it on the public site).

      Delete