Sunday, August 3, 2014

A new proximity query for Lucene, using automatons

The simplest Apache Lucene query, TermQuery, matches any document that contains the specified term, regardless of where the term occurs inside each document. Using BooleanQuery you can combine multiple TermQuerys, with full control over which terms are optional (SHOULD) and which are required (MUST) or required not to be present (MUST_NOT), but still the matching ignores the relative positions of each term inside the document.

Sometimes you do care about the positions of the terms, and for such cases Lucene has various so-called proximity queries.

The simplest proximity query is PhraseQuery, to match a specific sequence of tokens such as "Barack Obama". Seen as a graph, a PhraseQuery is a simple linear chain:

By default the phrase must precisely match, but if you set a non-zero slop factor, a document can still match even when the tokens are not exactly in sequence, as long as the edit distance is within the specified slop. For example, "Barack Obama" with a slop factor of 1 will also match a document containing "Barack Hussein Obama" or "Barack H. Obama". It looks like this graph:

Now there are multiple paths through the graph, including an any (*) transition to match an arbitrary token. (Note: while the graph cannot properly express it, this query would also match a document that had the tokens Barack and Obama on top of one another, at the same position, which is a little bit strange!)

In general, proximity queries are more costly on both CPU and IO resources, since they must load, decode and visit another dimension (positions) for each potential document hit. That said, for exact (no slop) matches, using common-grams, shingles and ngrams to index additional "proximity terms" in the index can provide enormous performance improvements in some cases, at the expense of an increase in index size.

MultiPhraseQuery is another proximity query. It generalizes PhraseQuery by allowing more than one token at each position, for example:

This matches any document containing either domain name system or domain name service. MultiPhraseQuery also accepts a slop factor to allow for non-precise matches.

Finally, span queries (e.g. SpanNearQuery, SpanFirstQuery) go even further, allowing you to build up a complex compound query based on positions where each clause matched. What makes them unique is that you can arbitrarily nest them. For example, you could first build a SpanNearQuery matching Barack Obama with slop=1, then another one matching George Bush, and then make another SpanNearQuery, containing both of those as sub-clauses, matching if they appear within 10 terms of one another.

Introducing TermAutomatonQuery

As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop. Here's an example:

This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that's done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.

In addition to this new query there is also a simple utility class, TokenStreamToTermAutomatonQuery, that provides loss-less translation of any graph TokenStream into the equivalent TermAutomatonQuery. This is powerful because it means even arbitrary token stream graphs will be correctly represented at search time, preserving the PositionLengthAttribute that some tokenizers now set.

While this means you can finally correctly apply arbitrary token stream graph synonyms at query-time, because the index still does not store PositionLengthAttribute, index-time synonyms are still not fully correct. That said, it would be simple to build a TokenFilter that writes the position length into a payload, and then to extend the new TermAutomatonQuery to read from the payload and apply that length during matching (patches welcome!).

The query is likely quite slow, because it assumes every term is optional; in many cases it would be easy to determine required terms (e.g. Obama in the above example) and optimize such cases. In the case where the query was derived from a token stream, so that it has no cycles and does not use any transitions, it may be faster to enumerate all phrases accepted by the automaton (Lucene already has the getFiniteStrings API to do this for any automaton) and construct a boolean query from those phrase queries. This would match the same set of documents, also correctly preserving PositionLengthAttribute, but would assign different scores.

The code is very new and there are surely some exciting bugs! But it should be a nice start for any application that needs precise control over where terms occur inside documents.


  1. Very nice article. I had a small question, is it correct to use it correct to use automaton to create something like trending phrases for example: Malaysia airline, Malaysian airline crash, Malaysian airline crash blackbox etc or is the use case totally misconstrued ??


    Ravi Kiran Bhaskar

    1. You could use an automaton to hold trending phrases...

  2. Mike, this is fantastic! Any plans to add multitermqueries and/or SpanNot? Apologies if I missed these...I've only had a chance to skim the test cases. Thank you, again!

    1. I don't have any plans ... patches welcome! In theory MTQ could just rewrite to a set of transitions from one state to another, one per term that the MTQ enumerated. Doing something like SpanNot would be trickier ... it may require a separate "special" label (like "any") that matches all but certain specified terms?

  3. Is great to hear that this is coming in Lucene 4.10!
    Your post `Lucene's TokenStreams are actually graphs` encouraged us at JusBrasil to start working on a custom TokenFilter/QueryParser to improve lucene's WorldDelimiter.
    We deployed it recently, and hope open source it soon. but we are using it as a ElasticSearch plugin, so we still need to look for the best way to decouple and release it.

    We don't actually use as graph of tokens, but only label then into groups and make a boolean query with the phrase query of each group.
    This may make it not generic enough to replace the default tokenfilter, but it was nice improvement to us, cause most of the data we index/query don't have a formal pattern, or the user aren't aware of it(e.g.: '2014.09-PE', '201409 PE', '201409/PE').

    Well, is awesome to hear that Lucene is improving in this area as well. Cheers

  4. This is a great tool , especially in the way that it is flexible.
    I use this to combine fuzzy and wildcard query.
    (for example a query on "levy" should also return "levinshtein")

    However, the scoring of documents is lost. I want the top results to be those which are more similar to the query, but i get all scores as 1.0
    Any ideas?

    1. The scoring of this query is "like" a PhraseQuery in that it counts how many matches occurred in the document and uses that as the "term freq" for scoring. So I'm not sure why you see always 1.0 score ... can you make up a test case and post to the Lucene user's list?