Changing Bits: Fast range faceting using segment trees and the Java ASM library

In Lucene's facet module we recently added support for dynamic range faceting, to show how many hits match each of a dynamic set of ranges. For example, the Updated drill-down in the Lucene/Solr issue search application uses range facets. Another example is distance facets (< 1 km, < 2 km, etc.), where the distance is dynamically computed based on the user's current location. Price faceting might also use range facets, if the ranges cannot be established during indexing.

To implement range faceting, for each hit, we first calculate the value (the distance, the age, the price) to be aggregated, and then lookup which ranges match that value and increment its counts. Today we use a simple linear search through all ranges, which has O(N) cost, where N is the number of ranges.

But this is inefficient!

Segment trees

There are fun data structures like segment trees and interval trees with O(log(N) + M) cost per lookup, where M is the number of ranges that match the given value. I chose to explore segment trees, as Lucene only requires looking up by a single value (interval trees can also efficiently look up all ranges overlapping a provided range) and also because all the ranges are known up front (interval trees also support dynamically adding or removing ranges).

If the ranges will never overlap, you can use a simple binary search; Guava's ImmutableRangeSet takes this approach. However, Lucene's range faceting allows overlapping ranges so we can't do that.

Segment trees are simple to visualize: you "project" all ranges down on top of one another, creating a one-dimensional Venn diagram, to define the elementary intervals. This classifies the entire range of numbers into a minimal number of distinct ranges, each elementary interval, such that all points in each elementary interval always match the same set of input ranges. The lookup process is then a binary search to determine which elementary interval a point belongs to, recording the matched ranges as you recurse down the tree.

Consider these ranges; the lower number is inclusive and the upper number is exclusive:

The elementary intervals (think Venn diagram!) are:

Finally, you build a binary tree on top of the elementary ranges, and then add output range indices to both internal nodes and the leaves of that tree, necessary to prevent adversarial cases that would require too much (O(N^2)) space. During lookup, as you walk down the tree, you gather up the output ranges (indices) you encounter; for our example, each elementary range is assigned the follow range indices as outputs:

  -∞ – 0 →
   0 – 10 → 0
  10 – 15 → 1, 2
  15 – 20 → 1, 2, 3
  20 – 30 → 2, 3
  30 – 40 → 3
  40 – 50 → 3, 4
  50 – 70 → 4
  70 – ∞  →

Some ranges correspond to 1 elementary interval, while other ranges correspond to 2 or 3 or more, in general. Some, 2 in this example, may have no matching input ranges.

Looking up matched ranges

I've pushed all sources described below to new Google code project; the code is still somewhat rough and exploratory, so there are likely exciting bugs lurking, but it does seem to work: it includes (passing!) tests and simple micro-benchmarks.

I started with a basic segment tree implementation as described on the Wikipedia page, for long values, called SimpleLongRangeMultiSet; here's the recursive lookup method:

  private int lookup(Node node, long v, int[] answers, int upto) {
    if (node.outputs != null) {
      for(int range : node.outputs) {
        answers[upto++] = range;
      }
    }
    if (node.left != null) {
      if (v <= node.left.end) {
        upto = lookup(node.left, v, answers, upto);
      } else {
        upto = lookup(node.right, v, answers, upto);
      }
    }

    return upto;
  }

This worked correctly, but I realized there must be non-trivial overhead for the recursion, checking for nulls, the for loop over the output values, etc. Next, I tried switching to parallel arrays to hold the binary tree (ArrayLongRangeMultiSet), where the left child of node N is at 2*N and the right child is at 2*N+1, but this turned out to be slower.

After that I tested a code specializing implementation, first by creating dynamic Java source code from the binary tree. This eliminates the recursion and creates a single simple method that uses a series of if statements, specialized to the specific ranges, to do the binary search and record the range indices. Here's the resulting specialized code, compiled from the above ranges:

  void lookup(long v, int[] answers) {
    int upto = 0;
    if (v <= 19) {
      if (v <= 9) {
        if (v >= 0) {
          answers[upto++] = 0;
          answers[upto++] = 1;
        }
      } else {
        answers[upto++] = 1;
        answers[upto++] = 2;
        if (v >= 15) {
          answers[upto++] = 3;
        }
      }
    } else {
      if (v <= 39) {
        answers[upto++] = 3;
        if (v <= 29) {
          answers[upto++] = 2;
        }
      } else {
        if (v <= 49) {
          answers[upto++] = 3;
          answers[upto++] = 4;
        } else {
          if (v <= 69) {
            answers[upto++] = 4;
          }
        }
      }
    }
  }

Finally, using the ASM library, I compiled the tree directly to specialized Java bytecode, and this proved to be fastest (up to 2.5X faster in some cases).

As a baseline, I also added the simple linear search method, LinearLongRangeMultiSet; as long as you don't have too many ranges (say 10 or less), its performance is often better than the Java segment tree.

The implementation also allows you to specify the allowed range of input values (for example, maybe all values are >=0 in your usage), which can save an if statement or two in the lookup method.

Counting all matched ranges

While the segment tree allows you to quickly look up all matching ranges for a given value, after a nice tip from fellow Lucene committee Robert Muir, we realized Lucene's range faceting does not need to know the ranges for each value; instead, it only requires the aggregated counts for each range in the end, after seeing many values.

This leads to an optimization: compute counts for each elementary interval and then in the end, roll up those counts to get the count for each range. This will only work for single-valued fields, since for a multi-valued field you'd need to carefully never increment the same range more than once per hit.

So based on that approach, I created a new LongRangeCounter abstract base class, and the SimpleLongRangeCounter Java implementation, and also the ASM specialized version, and the results are indeed faster (~20 to 50%) than using the lookup method to count; I'll use this approach with Lucene.

Segment trees are normally always "perfectly" balanced but one final twist I explored was to use a training set of values to bias the order of the if statements. For example, if your ranges cover a tiny portion of the search space, as is the case for the Updated drill-down, then it should be faster to use a slightly unbalanced tree, by first checking if the value is less than the maximum range. However, in testing, while there are some cases where this "training" is a bit faster, often it's slower; I'm not sure why.

Lucene

I haven't folded this into Lucene yet, but I plan to; all the exploratory code lives in the segment-trees Google code project for now.

Results on the micro-benchmarks can be entirely different once the implementations are folded into a "real" search application. While ASM is a powerful way to generate specialized code, and it gives sizable performance gains at least in the micro-benchmarks, it is an added dependency and complexity for ongoing development and many more developers know Java than ASM. It may also confuse hotspot, causing deoptimizations when there are multiple implementations for an abstract base class. Furthermore, if there are many ranges, the resulting specialized bytecode can be become quite large (but, still O(N*log(N)) in size), which may cause other problems. On balance I'm not sure the sizable performance gains (on a micro-benchmark) warrant using ASM in Lucene's range faceting.

8 comments:

Shai EreraDecember 13, 2013 at 6:22 AM
I looked at RangeAccumulator (4.6.0) and I see that indeed it scans through all the ranges (I thought it breaks on the first matching range). So while it indeed supports overlapping ranges, I wonder if the accumulator was specialized to non-overlapping ranges, and breaking after the first range that matches, if we gained anything? This is also true for the interval tree right? You won't need to rollup in the end?
Loïc PrietoDecember 18, 2013 at 3:15 AM
A very inspiring article, Michael! Thank you for your time and sharing.
I'm on the process of learning Lucene as part of getting to know the tools of my trade, and I find it fascinating, the endless possibilities and applications of the technology. Articles like the one you wrote shows a glimpse of the amazing things that can be built for enterprise software, beyond the usual and boring (after a while) Spring/JEE Stack.
AnonymousJanuary 29, 2014 at 2:08 AM
Hello Michael. I am starting to use Lucene more frequently in the projects. As my use case varies from 1 point to another, I am exploring more and more of Lucene. As such, I am finding it more fascinating and have started developing a strong affinity for Lucene.

As such I want to explore the source code of Lucene, so that I can explore and contribute to the Source Code. I would like to please let me know, how do I start exploring the lucene source code. I have already downloaded the 4.6 Lucene source code. Want to know the starting point and also how I can contribute to the project...

I might sound ambitious but nonetheless, as I said, Lucene seems to be very interesting and very addictive.
AnonymousFebruary 26, 2014 at 6:53 PM
newbie to solr, is it possible to get autocomplete from field:content through terms or suggest or facet by date range. I need a autocomplete for a word by date range.

Thursday, December 12, 2013

Fast range faceting using segment trees and the Java ASM library

8 comments: