Saturday, August 30, 2014

Scoring tennis using finite-state automata

For some reason having to do with the medieval French, the scoring system for tennis is very strange.

In actuality, the game is easy to explain: to win, you must score at least 4 points and win by at least 2. Yet in practice, you are supposed to use strange labels like "love" (0 points), "15" (1 point), "30" (2 points), "40" (3 points), "deuce" (3 or more points each, and the players are tied), "all" (players are tied) instead of simply tracking points as numbers, as other sports do.

This is of course wildly confusing to newcomers. Fortunately, the convoluted logic is easy to express as a finite-state automaton (FSA):

The game begins in the left-most (unlabeled) state, and then each time either player 1 (red) or player 2 (blue) scores, you advance to the corresponding state to know how to say the score properly in tennis-speak. In each state, player 1's score is first followed by player 2's; for example "40 30" means player 1 has scored 3 points and player 2 has scored 2 and "15 all" means both players have scored once. "adv 2" means player 2 is ahead by 1 point and will win if s/he scores again.

There are only 20 states, and there are cycles which means a tennis game can in fact go on indefinitely, if the players pass back and forth through the "deuce" (translation: game is tied) state.

This FSA is correct, and if you watch a Wimbledon match, for example, you'll see the game advance through precisely these states.


Yet for an FSA, merely being correct is not good enough!

It should also strive to be minimal, and surprisingly this FSA is not: if you build this Automaton in Lucene and minimize it, you'll discover that there are some wasted states! This means 20 states is overkill when deciding who won the game.

Specifically, there is no difference between the "30 all" and "deuce" states, nor between the "30 40" and "adv 2" states, nor between the "40 30" and "adv 1" states. From either state in each of these pairs, there is no sequence of player 1 / player 2 scoring that will result in a different final outcome (this is in principle how the minimization process identifies indistinguishable states).

Therefore, there's no point in keeping those states, and you can safely use this smaller 17-state FSA (15% smaller!) to score your tennis games instead:

For example, from "15 30", if player 1 scores, you go straight to "deuce" and don't bother with the redundant "30 30" state.

Another (simpler?) way to understand why these states are wasted is to recognize that the finite state machine is tracking two different pieces of information: first, how many points ahead player 1 is (since a player must win by 2 points) and second, how many points have been scored (since a player must score at least 4 points to win).

Once enough points (4 or more) have been scored by either player, their absolute scores no longer matter. All that matters is the relative score: whether player 1 is ahead by 1, equal, or behind by 1. For example, we don't care if the score is 197 to 196 or 6 to 5: they are the same thing.

Yet, early on, the FSA must also track the absolute scores, to ensure at least 4 points were scored by the winner. With the original 20-state FSA, the crossover between these two phases was what would have been "40 40" (each player scored 3 points). But in the minimal machine, the crossover became "30 30" (each player scored 2 points), which is safe since each player must still "win by 2" so if player 1 scores 2 points from "30 30", that means player 1 scored 4 points overall.

FSA minimization saved only 3 states for the game of tennis, resulting in a 15% smaller automaton, and maybe this simplifies keeping track of scores in your games by a bit, but in other FSA applications in Lucene, such as the analyzing suggester, MemoryPostingsFormat and the terms index, minimization is vital since it saves substantial disk and RAM for Lucene applications!

Sunday, August 3, 2014

A new proximity query for Lucene, using automatons

The simplest Apache Lucene query, TermQuery, matches any document that contains the specified term, regardless of where the term occurs inside each document. Using BooleanQuery you can combine multiple TermQuerys, with full control over which terms are optional (SHOULD) and which are required (MUST) or required not to be present (MUST_NOT), but still the matching ignores the relative positions of each term inside the document.

Sometimes you do care about the positions of the terms, and for such cases Lucene has various so-called proximity queries.

The simplest proximity query is PhraseQuery, to match a specific sequence of tokens such as "Barack Obama". Seen as a graph, a PhraseQuery is a simple linear chain:

By default the phrase must precisely match, but if you set a non-zero slop factor, a document can still match even when the tokens are not exactly in sequence, as long as the edit distance is within the specified slop. For example, "Barack Obama" with a slop factor of 1 will also match a document containing "Barack Hussein Obama" or "Barack H. Obama". It looks like this graph:

Now there are multiple paths through the graph, including an any (*) transition to match an arbitrary token. (Note: while the graph cannot properly express it, this query would also match a document that had the tokens Barack and Obama on top of one another, at the same position, which is a little bit strange!)

In general, proximity queries are more costly on both CPU and IO resources, since they must load, decode and visit another dimension (positions) for each potential document hit. That said, for exact (no slop) matches, using common-grams, shingles and ngrams to index additional "proximity terms" in the index can provide enormous performance improvements in some cases, at the expense of an increase in index size.

MultiPhraseQuery is another proximity query. It generalizes PhraseQuery by allowing more than one token at each position, for example:

This matches any document containing either domain name system or domain name service. MultiPhraseQuery also accepts a slop factor to allow for non-precise matches.

Finally, span queries (e.g. SpanNearQuery, SpanFirstQuery) go even further, allowing you to build up a complex compound query based on positions where each clause matched. What makes them unique is that you can arbitrarily nest them. For example, you could first build a SpanNearQuery matching Barack Obama with slop=1, then another one matching George Bush, and then make another SpanNearQuery, containing both of those as sub-clauses, matching if they appear within 10 terms of one another.

Introducing TermAutomatonQuery

As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop. Here's an example:

This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that's done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.

In addition to this new query there is also a simple utility class, TokenStreamToTermAutomatonQuery, that provides loss-less translation of any graph TokenStream into the equivalent TermAutomatonQuery. This is powerful because it means even arbitrary token stream graphs will be correctly represented at search time, preserving the PositionLengthAttribute that some tokenizers now set.

While this means you can finally correctly apply arbitrary token stream graph synonyms at query-time, because the index still does not store PositionLengthAttribute, index-time synonyms are still not fully correct. That said, it would be simple to build a TokenFilter that writes the position length into a payload, and then to extend the new TermAutomatonQuery to read from the payload and apply that length during matching (patches welcome!).

The query is likely quite slow, because it assumes every term is optional; in many cases it would be easy to determine required terms (e.g. Obama in the above example) and optimize such cases. In the case where the query was derived from a token stream, so that it has no cycles and does not use any transitions, it may be faster to enumerate all phrases accepted by the automaton (Lucene already has the getFiniteStrings API to do this for any automaton) and construct a boolean query from those phrase queries. This would match the same set of documents, also correctly preserving PositionLengthAttribute, but would assign different scores.

The code is very new and there are surely some exciting bugs! But it should be a nice start for any application that needs precise control over where terms occur inside documents.