In Lucene's facet module we recently added support
for
dynamic
range faceting, to show how many hits match each of a dynamic set
of ranges. For example, the
Updated
drilldown in the
Lucene/Solr
issue search application uses range facets. Another
example is distance facets (< 1 km, < 2 km, etc.), where the distance
is dynamically computed based on the user's current location. Price
faceting might also use range facets, if the ranges cannot be
established during indexing.
To implement range faceting, for each hit, we first calculate the
value (the distance, the age, the price) to be aggregated, and then
lookup which ranges match that value and increment its counts. Today
we use a simple linear search through all ranges, which
has
O(N)
cost, where
N
is the number of ranges.
But this is inefficient!
Segment trees
There are fun data structures
like
segment
trees and
interval
trees with
O(log(N) + M)
cost per lookup, where
M
is the number of ranges that match the given value. I
chose to explore segment trees, as Lucene only requires looking up by
a single
value (interval trees can also efficiently look up
all ranges overlapping a provided
range) and also because all
the ranges are known up front (interval trees also support dynamically
adding or removing ranges).
If the ranges will never overlap, you can use a simple binary search;
Guava's
ImmutableRangeSet
takes this approach. However,
Lucene's range faceting allows overlapping ranges so we can't do
that.
Segment trees are simple to visualize: you "project" all ranges down on
top of one another, creating a onedimensional
Venn diagram,
to define the
elementary intervals. This classifies the
entire range of numbers into a minimal number of distinct ranges, each
elementary interval, such that all points in each elementary interval
always match the same set of input ranges. The lookup process is then
a binary search to determine which elementary interval a point belongs
to, recording the matched ranges as you recurse down the tree.
Consider these ranges; the lower number is inclusive and the upper
number is exclusive:
0: 0 – 10
1: 0 – 20
2: 10 – 30
3: 15 – 50
4: 40 – 70
The elementary intervals (think Venn diagram!) are:
∞ – 0
0 – 10
10 – 15
15 – 20
20 – 30
30 – 40
40 – 50
50 – 70
70 – ∞
Finally, you build a binary tree on top of the elementary ranges, and
then add output range indices to both internal nodes and the leaves of
that tree, necessary to prevent adversarial cases that would require
too much (
O(N^2)
) space. During lookup, as you walk down
the tree, you gather up the output ranges (indices) you encounter; for
our example, each elementary range is assigned the follow range
indices as outputs:
∞ – 0 →
0 – 10 → 0
10 – 15 → 1, 2
15 – 20 → 1, 2, 3
20 – 30 → 2, 3
30 – 40 → 3
40 – 50 → 3, 4
50 – 70 → 4
70 – ∞ →
Some ranges correspond to 1 elementary interval, while other ranges
correspond to 2 or 3 or more, in general. Some, 2 in this example,
may have no matching input ranges.
Looking up matched ranges
I've pushed all sources described below
to
new Google code
project; the code is still somewhat rough and exploratory, so
there are likely exciting bugs lurking, but it does seem to work: it
includes (passing!) tests and simple microbenchmarks.
I started with a basic segment tree implementation as described on
the
Wikipedia
page, for long values,
called
SimpleLongRangeMultiSet
;
here's the recursive
lookup
method:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16  private int lookup(Node node, long v, int[] answers, int upto) {
if (node.outputs != null) {
for(int range : node.outputs) {
answers[upto++] = range;
}
}
if (node.left != null) {
if (v <= node.left.end) {
upto = lookup(node.left, v, answers, upto);
} else {
upto = lookup(node.right, v, answers, upto);
}
}
return upto;
}

This worked correctly, but I realized there must be nontrivial
overhead for the recursion, checking for nulls, the for loop over the
output values, etc. Next, I tried switching to parallel arrays to
hold the binary tree
(
ArrayLongRangeMultiSet),
where the left child of node N is at 2*N and the right child is at
2*N+1, but this turned out to be slower.
After that I tested a code specializing implementation, first by
creating dynamic Java source code from the binary tree. This eliminates the
recursion and creates a single simple method that uses a series of
if
statements, specialized to the specific ranges, to do the binary
search and record the range indices. Here's the resulting specialized
code, compiled from the above ranges:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33  void lookup(long v, int[] answers) {
int upto = 0;
if (v <= 19) {
if (v <= 9) {
if (v >= 0) {
answers[upto++] = 0;
answers[upto++] = 1;
}
} else {
answers[upto++] = 1;
answers[upto++] = 2;
if (v >= 15) {
answers[upto++] = 3;
}
}
} else {
if (v <= 39) {
answers[upto++] = 3;
if (v <= 29) {
answers[upto++] = 2;
}
} else {
if (v <= 49) {
answers[upto++] = 3;
answers[upto++] = 4;
} else {
if (v <= 69) {
answers[upto++] = 4;
}
}
}
}
}

Finally, using the
ASM library, I
compiled the tree directly to
specialized
Java
bytecode, and this proved to be fastest (up to 2.5X faster in some
cases).
As a baseline, I also added the simple linear
search
method,
LinearLongRangeMultiSet
;
as long as you don't have too many ranges (say 10 or less), its
performance is often better than the Java segment tree.
The implementation also allows you to specify the allowed range of
input values (for example, maybe all values are >=0 in your usage),
which can save an if statement or two in the lookup method.
Counting all matched ranges
While the segment tree allows you to quickly look up all matching
ranges for a given value, after a nice tip from fellow Lucene
committee Robert Muir, we realized Lucene's range faceting does not
need to know the ranges for each value; instead, it only requires the
aggregated counts for each range in the end, after seeing many values.
This leads to an optimization: compute counts for each elementary
interval and then in the end, roll up those counts to get the count
for each range. This will only work for singlevalued fields, since
for a multivalued field you'd need to carefully never increment the
same range more than once per hit.
So based on that approach, I created a
new
LongRangeCounter
abstract base class, and
the
SimpleLongRangeCounter
Java implementation, and also the ASM specialized version, and the
results are indeed faster (~20 to 50%) than using the lookup method to
count; I'll use this approach with Lucene.
Segment trees are normally always "perfectly" balanced but one final
twist I explored was to use a
training set of values to bias
the order of the if statements. For example, if your ranges cover a
tiny portion of the search space, as is the case for
the
Updated
drilldown, then it should be faster to use a
slightly unbalanced tree, by first checking if the value is less
than the maximum range. However, in testing, while there are some
cases where this "training" is a bit faster, often it's slower; I'm
not sure why.
Lucene
I haven't folded this into Lucene yet, but I plan to; all the
exploratory code lives in
the
segmenttrees
Google code project for now.
Results on the microbenchmarks can be entirely different once the
implementations are folded into a "real" search application.
While
ASM is a powerful way to
generate specialized code, and it gives sizable performance gains at
least in the microbenchmarks, it is an added dependency and
complexity for ongoing development and many more developers know Java than
ASM. It may also confuse hotspot,
causing deoptimizations when there are multiple implementations for an
abstract base class. Furthermore, if there are many ranges, the
resulting specialized bytecode can be become quite large (but,
still
O(N*log(N))
in size), which may cause other
problems. On balance I'm not sure the sizable performance gains
(on a microbenchmark) warrant using ASM in Lucene's range faceting.