For some reason having to do
with the medieval
French,
the scoring
system for tennis is very strange.
In actuality, the game is easy to explain: to win, you must score at
least 4 points and win by at least 2. Yet in practice, you are
supposed to use strange labels like "love" (0 points), "15" (1 point),
"30" (2 points), "40" (3 points), "deuce" (3 or more points each, and
the players are tied), "all" (players are tied) instead of simply
tracking points as numbers, as other sports do.
This is of course wildly confusing to newcomers. Fortunately, the
convoluted logic is easy to express as
a finite-state
automaton (FSA):
The game begins in the left-most (unlabeled) state, and then each time
either player 1 (red) or player 2
(blue) scores, you advance to the
corresponding state to know how to say the score properly in
tennis-speak. In each state, player 1's score is first followed by
player 2's; for example "40 30" means player 1 has scored 3 points and
player 2 has scored 2 and "15 all" means both players have scored
once. "adv 2" means player 2 is ahead by 1 point and will win if s/he
scores again.
There are only 20 states, and there are cycles
which means a tennis game can in fact go on indefinitely, if the
players pass back and forth through the "deuce" (translation: game is
tied) state.
This FSA is correct, and if you watch
a Wimbledon match,
for example, you'll see the game advance through precisely these
states.
Minimization
Yet for an FSA, merely being correct is not good enough!
It should
also strive to be minimal, and surprisingly this FSA is not: if you
build this Automaton in Lucene
and minimize it, you'll discover that there are some wasted states!
This means 20 states is overkill when deciding who won the game.
Specifically, there is no difference between the "30 all" and "deuce"
states, nor between the "30 40" and "adv 2" states, nor between the
"40 30" and "adv 1" states. From either state in each of these pairs,
there is no sequence of player 1 / player 2 scoring that will result
in a different final outcome (this is in principle how the minimization
process identifies indistinguishable states).
Therefore, there's no point in keeping those states, and you can
safely use this smaller 17-state FSA (15% smaller!) to score your
tennis games instead:
For example, from "15 30", if player 1 scores, you go straight to
"deuce" and don't bother with the redundant "30 30" state.
Another (simpler?) way to understand why these states are wasted is to
recognize that the finite state machine is tracking two different pieces
of information: first, how many points ahead player 1 is (since a
player must win by 2 points) and second, how many points have been
scored (since a player must score at least 4 points to win).
Once enough points (4 or more) have been scored by either player,
their absolute scores no longer matter. All that matters is the
relative score: whether player 1 is ahead by 1, equal, or behind by 1.
For example, we don't care if the score is 197 to 196 or 6 to 5: they
are the same thing.
Yet, early on, the FSA must also track the absolute scores, to ensure
at least 4 points were scored by the winner. With the original
20-state FSA, the crossover between these two phases was what would
have been "40 40" (each player scored 3 points). But in the minimal
machine, the crossover became "30 30" (each player scored 2 points),
which is safe since each player must still "win by 2" so if player 1
scores 2 points from "30 30", that means player 1 scored 4 points
overall.
FSA minimization saved only 3 states for the game of tennis, resulting
in a 15% smaller automaton, and maybe this simplifies keeping track of
scores in your games by a bit, but in
other FSA
applications in Lucene, such
as the
analyzing
suggester, MemoryPostingsFormat
and the terms index, minimization is vital since it saves substantial disk and RAM for
Lucene applications!
Saturday, August 30, 2014
Sunday, August 3, 2014
A new proximity query for Lucene, using automatons
The simplest Apache Lucene query,
Sometimes you do care about the positions of the terms, and for such cases Lucene has various so-called proximity queries.
The simplest proximity query is
By default the phrase must precisely match, but if you set a non-zero slop factor, a document can still match even when the tokens are not exactly in sequence, as long as the edit distance is within the specified slop. For example, "Barack Obama" with a slop factor of 1 will also match a document containing "Barack Hussein Obama" or "Barack H. Obama". It looks like this graph:
Now there are multiple paths through the graph, including an any (
In general, proximity queries are more costly on both CPU and IO resources, since they must load, decode and visit another dimension (positions) for each potential document hit. That said, for exact (no slop) matches, using common-grams, shingles and ngrams to index additional "proximity terms" in the index can provide enormous performance improvements in some cases, at the expense of an increase in index size.
This matches any document containing either
Finally, span queries (e.g.
Introducing TermAutomatonQuery
As of Lucene 4.10 there will be a new proximity query to further generalize on
This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no
In addition to this new query there is also a simple utility class,
While this means you can finally correctly apply arbitrary token stream graph synonyms at query-time, because the index still does not store
The query is likely quite slow, because it assumes every term is optional; in many cases it would be easy to determine required terms (e.g. Obama in the above example) and optimize such cases. In the case where the query was derived from a token stream, so that it has no cycles and does not use any transitions, it may be faster to enumerate all phrases accepted by the automaton (Lucene already has the
The code is very new and there are surely some exciting bugs! But it should be a nice start for any application that needs precise control over where terms occur inside documents.
TermQuery
, matches any document that contains the specified term, regardless of where the term occurs inside each document. Using BooleanQuery
you can combine multiple TermQuery
s, with full control over which terms
are optional (SHOULD
) and which are required (MUST
) or required not to be present (MUST_NOT
), but still the matching ignores the relative positions of each term inside the document.
Sometimes you do care about the positions of the terms, and for such cases Lucene has various so-called proximity queries.
The simplest proximity query is
PhraseQuery
, to match a specific sequence of tokens such as "Barack Obama". Seen as a graph, a PhraseQuery
is a simple linear chain:
By default the phrase must precisely match, but if you set a non-zero slop factor, a document can still match even when the tokens are not exactly in sequence, as long as the edit distance is within the specified slop. For example, "Barack Obama" with a slop factor of 1 will also match a document containing "Barack Hussein Obama" or "Barack H. Obama". It looks like this graph:
Now there are multiple paths through the graph, including an any (
*
) transition to match an arbitrary token. (Note: while the graph cannot properly express it, this query would also match a document that had the tokens Barack and Obama on top of one another, at the same position, which is a little bit strange!)
In general, proximity queries are more costly on both CPU and IO resources, since they must load, decode and visit another dimension (positions) for each potential document hit. That said, for exact (no slop) matches, using common-grams, shingles and ngrams to index additional "proximity terms" in the index can provide enormous performance improvements in some cases, at the expense of an increase in index size.
MultiPhraseQuery
is another proximity query. It generalizes PhraseQuery
by allowing more than one token at each position, for example:
This matches any document containing either
domain name system
or domain name service
. MultiPhraseQuery
also accepts a slop factor to allow for non-precise matches.
Finally, span queries (e.g.
SpanNearQuery
, SpanFirstQuery
) go even further, allowing you to build up a complex compound query based on positions where each clause matched. What makes them unique is that you can arbitrarily nest them. For example, you could first build a SpanNearQuery
matching Barack Obama with slop=1, then another one matching George Bush, and then make another SpanNearQuery
, containing both of those as sub-clauses, matching if they appear within 10 terms of one another.
Introducing TermAutomatonQuery
As of Lucene 4.10 there will be a new proximity query to further generalize on
MultiPhraseQuery
and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop. Here's an example:
This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no
QueryParser
support yet, patches welcome!). Once that's done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton
) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.
In addition to this new query there is also a simple utility class,
TokenStreamToTermAutomatonQuery
, that provides loss-less translation of any graph TokenStream
into the equivalent TermAutomatonQuery
. This is powerful because it means even arbitrary token stream graphs will be correctly represented at search time, preserving the PositionLengthAttribute
that some tokenizers now set.
While this means you can finally correctly apply arbitrary token stream graph synonyms at query-time, because the index still does not store
PositionLengthAttribute
, index-time synonyms are still not fully correct. That said, it would be simple to build a TokenFilter
that writes the position length into a payload, and then to extend the new TermAutomatonQuery
to read from the payload and apply that length during matching (patches welcome!).
The query is likely quite slow, because it assumes every term is optional; in many cases it would be easy to determine required terms (e.g. Obama in the above example) and optimize such cases. In the case where the query was derived from a token stream, so that it has no cycles and does not use any transitions, it may be faster to enumerate all phrases accepted by the automaton (Lucene already has the
getFiniteStrings
API to do this for any automaton) and construct a boolean query from those phrase queries. This would match the same set of documents, also correctly preserving PositionLengthAttribute
, but would assign different scores.
The code is very new and there are surely some exciting bugs! But it should be a nice start for any application that needs precise control over where terms occur inside documents.
Subscribe to:
Posts (Atom)