Have you always wanted your very own Lucene finite state transducer (FST) but you couldn't figure out how to use Lucene's crazy APIs?
Then today is your lucky day! I just built
a simple web
application that creates an FST from the input/output strings that
If you just want a finite state automaton (no outputs) then enter only
as this example:
If all of your outputs
integers then the FST will use numeric outputs, where you sum up
the outputs as you traverse a path to get the final output:
Finally, if the outputs are non-numeric then they are treated as
strings, in which case you concatenate as you traverse the path:
The red arcs are the ones with the NEXT optimization: these arcs do
not store a pointer to a node because their to-node is the very next
node in the FST. This is a good optimization: it generally results in
large reduction of the FST size. The bolded arcs tell you the
next node is final; this is most interesting when a prefix of another
input is accepted, such as this example:
Here the "r" arc is bolded, telling you that "star" is accepted.
Furthermore, that node following the "r" arc has a final output,
telling you the overall output for "star" is "abc".
The web app is a simple Python WSGI app; source code
It invokes a simple Java tool as a subprocess; source code (including
Hello Mr McCandless,ReplyDelete
Nice blog! Is there an email address I can contact you in private?
Head of Editorial Team
Java Code Geeks
is it possible to optimize the chains "cde" and "xcde" of the entries "abxcdey abicdej" into one "cde"?ReplyDelete
No, FSTs cannot do that, because those two "cde" carry different state since they complete to different suffixes.Delete
If they both completed to the same set of suffixes then "cde" is shared, e.g.:
Excellent article. Any thoughts how this can be used for key/value store kind of applicationReplyDelete
FSTs can easily store key/value pairs, since they really act like a SortedMap, however this is only "useful" if 1) you can fit the entire map into RAM, and 2) the keys have commonalities to them (e.g. words from natural language) in which case they compress well.Delete
Hello Michael, how are you?ReplyDelete
Hope you can help me. In the Lucene API have the ability to change a document, removes and adds the document. I have the need to add/remove a term by docId/field. There is the possibility to perform the link between a term with a field and its existing document? (field -> terms -> term -> DocIds)
writer.removeTerm (docId, field, term);
writer.addTerm (docId, field, term);
Can you send this question to email@example.com instead?
Hello Mr. McCandless,ReplyDelete
Java Tool link now points to a 404 not found page. Can you give me a valid link?
Your posts about FST really help me a lot. Thank you.
It seems to be working now; try again?
very interesting article. While the web app is still working, the source code is no longer accessible (404-error). Any chance to fix that?
Woops, thanks for notifying me ... I just fixed those links so they should now work again!
nice article...btw, could you please guide me how to run the web app and java tool (both source codes are attached by you) in my Windows 7 system?...pls mail me at firstname.lastname@example.orgReplyDelete
I don't use Windows so I really can't help much here, but the code (Java, Python) should be portable... or you can just use the instance that I keep running at http://examples.mikemccandless.com/fst.py
then please guide me how to run that on Ubuntu(I think you use that platform) ..I would be grateful to you...btw, http://examples.mikemccandless.com/fst.py link is not working.Delete
Have a look at fstApp.py: this is a standard Python WSGI app. You can run it directly with python for testing, or you can run it e.g. with mod_wsgi using Apache (this is how I run it on the public site).Delete