tag:blogger.com,1999:blog-8623074010562846957.post6016018046456427862..comments2023-09-01T03:38:08.236-04:00Comments on Changing Bits: Using Finite State Transducers in LuceneMichael McCandlesshttp://www.blogger.com/profile/04277432937861334672noreply@blogger.comBlogger23125tag:blogger.com,1999:blog-8623074010562846957.post-48041267748685581282018-09-03T21:14:54.089-04:002018-09-03T21:14:54.089-04:00Hi Michael,
Is it possible that the FST takes r...Hi Michael,<br /> Is it possible that the FST takes regular expressions so that any sequence matching one regular expression can be transformed to an output. Could you please give me a few pointers please. Thank you!Anonymoushttps://www.blogger.com/profile/02800312246012775631noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-74706918787392531342018-07-07T05:36:19.019-04:002018-07-07T05:36:19.019-04:00After giving it more thoughts, 3)is not correct.
...After giving it more thoughts, 3)is not correct.<br /><br />Perhaps I'll just ask, what would your FST would look like if we indexed popstar?Pussystrokerhttps://www.blogger.com/profile/16483366178504538581noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-9734777832207253322018-07-07T05:20:00.641-04:002018-07-07T05:20:00.641-04:00Hi Michael,
I have three newbie questions about th...Hi Michael,<br />I have three newbie questions about the (your) FST:<br />1) "As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4."<br /><br />For a given word (in your discussion, the word "stop"), are the output values assigned to arbitrary transition ("arcs")? Could you have instead assign: "stop hits 3 on the t and 1 on the o, so its output ordinal is 4."?<br /><br />2) From the start state, why do transitions "p/2" and "t/5" going to the same state?<br /><br />3) For simplicity, let's ignore stemming for a sec, and suppose we index a new word "popstar", do we simply introduce an arc going from end state back to the state which "s/3" points to?Pussystrokerhttps://www.blogger.com/profile/16483366178504538581noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-83709949799681044452018-07-07T05:18:09.359-04:002018-07-07T05:18:09.359-04:00This comment has been removed by the author.Pussystrokerhttps://www.blogger.com/profile/16483366178504538581noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-89658968261888143062017-03-20T15:06:47.130-04:002017-03-20T15:06:47.130-04:00Hi michael,
I have a question regarding prefix que...Hi michael,<br />I have a question regarding prefix queries. How expensive it is to do prefix queries. <br />We are using solr in our organization. Majority of our queries are becoming prefix queries and i am worried that it will effect the performance. since it is not searching at that point but more of a matching. <br /><br />what do you suggest if majority of the queries are prefix then should these be handled on index time using something like EdgeNgram instead doing all this at query time.<br /><br />Just to give you an idea , our index might have 500 million terms.Anonymoushttps://www.blogger.com/profile/12671228725649619010noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-47457394130694843192015-02-11T05:12:32.584-05:002015-02-11T05:12:32.584-05:00Thats astonishing...Thats astonishing...Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-85959787400063741802014-09-15T04:24:35.587-04:002014-09-15T04:24:35.587-04:00Hi Anonymous,
Associating a String to an arbitrar...Hi Anonymous,<br /><br />Associating a String to an arbitrary (including Integer) output is precisely what FSTs are for, so that use case works easily. Try looking at TestFSTs.java for an example?Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-2937322670761630832014-09-14T09:54:04.773-04:002014-09-14T09:54:04.773-04:00I was wondering whether it is possible to associat...I was wondering whether it is possible to associate ids with recognized strings in the automata. Say for instance I have a lexicon and each of its entry has a unique integer id. I'd like to compile the lexicon as an union string automata and whenever it recognizes an entry in a string, it outputs the unique id. It's like using the automata as a set of primary keys.<br />I could not find such an example of use online.<br />In a previous comment you mention MemoryPostingsFormat begin able to output ids (but document id), would it be possible to use that ? ThanksAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-2015518395366412912014-04-10T10:11:59.044-04:002014-04-10T10:11:59.044-04:00i see thanks. I am almost sure you have seen the t...i see thanks. I am almost sure you have seen the talk Michael Butch(sorry if i misspelled his name) gave about their implementation of memory postings in lucene and recent improvments that can lead them to be merged back in lucene, wondering is there initial talk about that. Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-15811644015356159552014-04-10T09:55:31.632-04:002014-04-10T09:55:31.632-04:00Yes, you can encode that into an FST, and e.g. Luc...Yes, you can encode that into an FST, and e.g. Lucene's MemoryPostingsFormat does exactly that. But this is not a good fit for fields where each term may occur in many docs, because the output byte[] in that case can get quite large.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-30597329155243095012014-04-10T08:00:22.634-04:002014-04-10T08:00:22.634-04:00I am curious what is the performance of FST. And i...I am curious what is the performance of FST. And its possible to map the words with id they occur in (a simple typeahead). and this will output the ids of document in which word occur can we somehow emulate the ranking formula (Similarity, or bmk21) and bake that in the FST, of course it would not be that simple but is it possilbe. Lets say we control the preprocess of words i.e know how many times the word appeared in the doc know how times word appeared in all docsids and then <br />bake that in some fst`s like wordFST scoringFST and so on. is this approach will benefit over traditional, and as i red there is no incremental way of updateing a fst i has to be rebuild from ground up.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-52847583718087039822014-03-07T06:27:42.480-05:002014-03-07T06:27:42.480-05:00FSTs are used only in certain places, e.g. the ter...FSTs are used only in certain places, e.g. the terms index (held in RAM, mapping term prefixes to blocks on disk), dictionaries (hunspell), suggesters, synonyms, etc. They are also used in an optional MemoryPostingsFormat, where each term and its postings are held in a big FST (good for primary key fields). We also more recently added a terms dict that holds just the terms + pointers to on-disk postings, in an FST.<br /><br />What the output contains / where it points to varies drastically with each of these uses... really it's generic and can be whatever you need it to be.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-6161397459467394542014-03-07T04:01:15.103-05:002014-03-07T04:01:15.103-05:00Hey Mike,
Are we using fst to make inverted index?...Hey Mike,<br />Are we using fst to make inverted index? If yes, where the output of traversing a string in fst pointing to?<br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-2762714765118773132013-10-11T07:48:20.329-04:002013-10-11T07:48:20.329-04:00Hi Alessandro,
Maybe read the original paper for ...Hi Alessandro,<br /><br />Maybe read the original paper for the FST building algorithm we implemented in Lucene? http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698<br /><br />But note that the Levenshtein automaton that we build/intersect for FuzzyQuery is not an FST; it's an Automaton (Lucene has two separate packages for these). I suspect by not pre-building the automaton, but rather expanding it on the fly, we could see some speedups in certain cases; not sure how much.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-42724803258008678292013-10-04T07:15:41.484-04:002013-10-04T07:15:41.484-04:00Yes , it makes sense and sounds reasonable :)
But ...Yes , it makes sense and sounds reasonable :)<br />But is it feasible ?<br />I am starting learning about FSA and FST now, so I'm not yet into the implementation,<br />but one thing I remember is that Lucene FST implementation is so good and compact because it creates an immutable FST ( that becomes a sort of byte array under the hood) .<br />So creating an expandable FST is already possible or it will be a challenge ?<br />Can you suggest me other material to study ragarding this really interesting topic ? ( I followed your Solr revolution conference, some blog posts and videos so far :) )Anonymoushttps://www.blogger.com/profile/06206197690807293239noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-52656822465049307982013-10-04T07:06:46.232-04:002013-10-04T07:06:46.232-04:00Yes, that automaton.
But it turns out you don'...Yes, that automaton.<br /><br />But it turns out you don't have to fully build it up front, in order to do the "intersection". You can expand it as-you-go, which in theory saves time because then you only expand the parts that your terms dictionary would ever have encountered. It's sort of like a lazy automaton building ...Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-83874982400155841212013-10-03T12:56:19.346-04:002013-10-03T12:56:19.346-04:00Thank you Michael for the quick answer,
For the in...Thank you Michael for the quick answer,<br />For the intermediate automaton are you referring to the Levenstein Automaton, accepting all the strings with the expected distance from the query term ?<br />So directly building an automaton, accepting only Index terms with the expected distance from the the query term ?Anonymoushttps://www.blogger.com/profile/06206197690807293239noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-57496333445498088392013-10-03T07:28:36.803-04:002013-10-03T07:28:36.803-04:00Hi Alessandro,
Unfortunately, I believe the Fuzzy...Hi Alessandro,<br /><br />Unfortunately, I believe the FuzzyQuery performance was slower with the new FST-based terms dictionary, but yes, that is the idea: it ought to be faster.<br /><br />DirectPostingsFormat, which stores all terms AND postings in RAM, does give a good speedup for fuzzy queries, so a DirectSpellChecker implemented on that would be faster. However, this format is VERY RAM consuming, i.e. it makes no effort to "compress" things (unlike FST).<br /><br />I think there is room for a happy medium, e.g. pull out the terms dict from DirectPostingsFormat, maybe compress it a bit (but not as much as FST), and I think we'd see good speedups for terms-dict intensive operations.<br /><br />We may even want to implement the Levenshtein intersection without the intermediate automaton, i.e. "on the fly", because constructing that automaton takes non-trivial time.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-27425417587225912182013-10-01T10:58:55.948-04:002013-10-01T10:58:55.948-04:00Hi Michael,
it seems that the result of the Google...Hi Michael,<br />it seems that the result of the Google Summer of Code project has been committed.<br />Having a FST holding the complete Term Dictionary I think could be useful, for example in the SpellCheck scenario, to make a quick intersection between the Levenstein Automaton and the Index Terms Dictionary .<br />Am I right ?<br />Also in the Term component scenario, i think it can give us improvements ...<br />Can you briefly summarize me where do you think to use the FST Term Dictionary ? :)<br /><br />CheersAnonymoushttps://www.blogger.com/profile/06206197690807293239noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-69707199701559733622013-08-02T09:48:15.633-04:002013-08-02T09:48:15.633-04:00Hi kbros,
The FST seek time is typically a tiny p...Hi kbros,<br /><br />The FST seek time is typically a tiny part of the overall query execution time; usually most of the time is spent iterating through the postings once the lookup is done in the terms dict. Only terms-dict heavy queries (e.g. FuzzyQuery, DirectSpellChecker, WildcardQuery) will really stress the terms dictionary.<br /><br />That said, we have a Google Summer of Code project now working on using an FST to hold all terms in the terms dict, which then relies much more on the FST performance ... https://issues.apache.org/jira/browse/LUCENE-3069 has the details.<br /><br />But if you are worried about it, run a profiler and report back the results! There are always things that need fixing :)Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-56196551353922698782013-08-01T20:08:56.148-04:002013-08-01T20:08:56.148-04:00Hi Mike,
As the FST has a high CPU because of seek...Hi Mike,<br />As the FST has a high CPU because of seeking, how bad are its performances affected by multiplying the number of existing terms?<br /><br />Do you have an estimation of how is a typical query time distributed between the FST, seek in the term dictionary, calculating frequency or positions? How can I profile that in my own index?<br /><br />Thanks!kbroshttps://www.blogger.com/profile/15698030213270606806noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-35359225951577160252013-05-25T10:51:17.811-04:002013-05-25T10:51:17.811-04:00Hi Michael,
That FST held all unique terms (token...Hi Michael,<br /><br />That FST held all unique terms (tokens) from indexing all English Wikipedia content.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-52670124888631959502013-05-24T22:53:43.197-04:002013-05-24T22:53:43.197-04:00Hello, what does the fst with 69 mb for wikipedia ...Hello, what does the fst with 69 mb for wikipedia contains ? Only the titles of the entries or the whole text as well ?Michael Kleenhttps://www.blogger.com/profile/00619748254697177073noreply@blogger.com