To understand these features, and why they are important, we first need a little background. Lucene's facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.
At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.
This is in contrast to purely dynamic faceting implementations like ElasticSearch's and Solr's, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.
However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr's
UnInvertedField
, must be rebuilt on every reopen.
The taxonomy index used by the facet module means no extra work needs
to be done on each near-real-time reopen.
Enough background, now on to our two new features!
Sorted-set doc-values faceting
These features bring two dynamic alternatives to the facet module, both computing facet counts from previously indexed doc-values fields. The first feature, sorted-set doc-values faceting (see LUCENE-4795), allows the application to index a normal sorted-set doc-values field, for example:
doc.add(new SortedSetDocValuesField("foo")); doc.add(new SortedSetDocValuesField("bar"));and then to compute facet counts at search time using
SortedSetDocValuesAccumulator
and SortedSetDocValuesReaderState
.
This feature does not use the taxonomy index, since all state is stored in the doc-values, but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed to map per-segment integer ordinals to global ordinals. The good news is this should be relatively low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit the documents (unlike
UnInvertedField
).
At search time there is also a small performance hit (~25%, depending on the query) since each per-segment ord must be re-mapped to the global ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this feature currently only works with non-hierarchical facet fields, though this should be fixable (patches welcome!).
Dynamic range faceting
The second new feature, dynamic range faceting, works on top of a numeric doc-values field (see LUCENE-4965), and implements dynamic faceting over numeric ranges. You create a
RangeFacetRequest
, providing custom ranges with their
labels. Each matched document is checked against all ranges and the
count is incremented when there is a match. The range-test is a naive
simple linear search, which is probably OK since there are usually
only a few ranges, but we could eventually upgrade this to
an interval
tree to get better performance (patches welcome!).
Likewise, this new feature does not use the taxonomy index, only a numeric doc-values field. This feature is especially useful with time-based fields. You can see it in action in the Jira issues search example in the
Updated
field.
Happy faceting!