Monday, September 13, 2010

Fast search filters using flex

A filter in Lucene is a bit set that restricts the search space for any query; you pass it into IndexSearcher's search method. It's effective for a number of use cases, such as document security, index partitions, facet drill-down, etc.

To apply a filter, Lucene must compute the intersection of the documents matching the query against the documents allowed by the filter. Today, we do that in IndexSearcher like this:

while (true) {
if (scorerDoc == filterDoc) {
// Check if scorer has exhausted, only before collecting.
if (scorerDoc == DocIdSetIterator.NO_MORE_DOCS) {
break;
}
collector.collect(scorerDoc);
filterDoc = filterIter.nextDoc();
scorerDoc = scorer.advance(filterDoc);
} else if (scorerDoc > filterDoc) {
filterDoc = filterIter.advance(scorerDoc);
} else {
scorerDoc = scorer.advance(filterDoc);
}
}

We call this the "leapfrog approach": the query and the filter take turns trying to advance to each other's next matching document, often jumping past the target document. When both land on the same document, it's collected.

Unfortunately, for various reasons this implementation is inefficient (these are spelled out more in LUCENE-1536):
  1. The advance method for most queries is costly.
  2. The advance method for most filters is usually cheap.
  3. If the number of documents matching the query is far higher than
    the number matching the filter, or vice versa, it's better to drive
    the matching by whichever is more restrictive.
  4. If the filter supports fast random access, and is not super
    sparse, it's better to apply it during postings enumeration, like
    deleted docs.
  5. Query scorers don't have a random access API, only .advance(),
    which does unecessary extra work .next()'ing to the next matching
    document.
To fix this correctly, Lucene really needs an optimization stage, much like a modern database, which looks at the type and structure of the query and filter, as well as estimates of how many documents will match, and then picks an appropriate execution path. Likely this will be coupled with source code specilization/generation (LUCENE-1594) to write dedicated java code to execute the chosen path. Some day we'll get to that point!

Until then, there in a simple way to get a large speedup in many cases, addressing the 4th issue above. Prior to flexible indexing, when you obtained the postings enumeration for documents matching a given term, Lucene would silently filter out deleted documents. With flexible indexing, the API now allows you to pass in a bit set marking the documents to skip. Normally you'd pass in the IndexReader's deleted docs. But, with a simple subclass of FilterIndexReader, it's possible to use any filter as the documents to skip.

To test this, I created a simple class, CachedFilterIndexReader (I'll attach it to LUCENE-1536). You pass it an existing IndexReader, plus a Filter, and it creates an IndexReader that filters out both deleted documents and documents that don't match the provided filter. Basically, it compiles the IndexReader's deleted docs (if any), and the negation of the incoming filter, into a single cached bit set, and then passes that bit set as the skipDocs whenever postings are requested. You can then create an IndexSearcher from this reader, and all searches against it will be filtered according to the filter you provided.

This is just a prototype, and has certain limitations, eg it doesn't implement reopen, it's slow to build up its cached filter, etc.

Still, it works very well! I tested it on a 10M Wikipedia index, with a random filter accepting 50% of the documents:

QueryQPS DefaultQPS Flex% change
united~0.719.9519.25-3.5%
un*d43.1952.2120.9%
unit*21.5330.5241.8%
"united states"6.128.7442.9%
+united +states9.6814.2347.0%
united states7.7114.5688.9%
states15.7336.05129.2%

I'm not sure why the fuzzy query got a bit slower, but the speedups on the other queries are awesome. However, this approach is actually slower if the filter is very sparse. To test this, I ran just the TermQuery ("states"), against different filter densities:

The cutover, for TermQuery at least, is somewhere around 1.1%, meaning if the filter accepts more than 1.1% of the index, it's best to use the CachedFilterIndexReader class; otherwise it's best to use Lucene's current implementation.

Thanks to this new flex API, until we can fix Lucene to properly optimize for filter and query intersection, this class gives you a viable, fully external, means of massive speedups for non-sparse filters!

No comments:

Post a Comment