Changing Bits: Dynamic faceting with Lucene

Tuesday, May 21, 2013

Dynamic faceting with Lucene

Lucene's facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I'll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene's facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch's and Solr's, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

Enough background, now on to our two new features!

Sorted-set doc-values faceting

These features bring two dynamic alternatives to the facet module, both computing facet counts from previously indexed doc-values fields. The first feature, sorted-set doc-values faceting (see LUCENE-4795), allows the application to index a normal sorted-set doc-values field, for example:

  doc.add(new SortedSetDocValuesField("foo"));
  doc.add(new SortedSetDocValuesField("bar"));

and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.

This feature does not use the taxonomy index, since all state is stored in the doc-values, but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed to map per-segment integer ordinals to global ordinals. The good news is this should be relatively low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit the documents (unlike UnInvertedField).

At search time there is also a small performance hit (~25%, depending on the query) since each per-segment ord must be re-mapped to the global ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this feature currently only works with non-hierarchical facet fields, though this should be fixable (patches welcome!).

Dynamic range faceting

The second new feature, dynamic range faceting, works on top of a numeric doc-values field (see LUCENE-4965), and implements dynamic faceting over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels. Each matched document is checked against all ranges and the count is incremented when there is a match. The range-test is a naive simple linear search, which is probably OK since there are usually only a few ranges, but we could eventually upgrade this to an interval tree to get better performance (patches welcome!).

Likewise, this new feature does not use the taxonomy index, only a numeric doc-values field. This feature is especially useful with time-based fields. You can see it in action in the Jira issues search example in the Updated field.

Happy faceting!

23 comments:

12013June 10, 2013 at 8:00 AM
Mike, this is great.

Is there any way to tell when the 4.4 release will take place?
ReplyDelete
Replies
Michael McCandlessJune 10, 2013 at 8:04 AM
Hi 12013,

Well, it's open-source, so there's no hard guarantee :) It ships when it ships ... but, generally we've been doing .x releases every few months.

You can also easily build the Lucene JARs off a 4.x branch checkout.
ReplyDelete
Replies
AnonymousSeptember 26, 2013 at 5:57 PM
Hi, Michael, how are you doing?

My name is Tony.

I am particularity interested in the feature of Dynamic range facet.
I am wondering do you know any resource that have examples of how to use this feature?

Thank you very much!

-tony
ReplyDelete
Replies
Michael McCandlessSeptember 27, 2013 at 4:57 PM
Hi Tungnian,

Have a look at the unit test: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeAccumulator.java
ReplyDelete
Replies
SreeeNovember 29, 2013 at 5:24 AM
Hi Mike,

We have a requirement where we need to display facet for price range.

Firstly let me explain our data structure.
We have products and product will be present on multiple stores and each store has it's own price.
Now the requirement is, when customer is logged in price displayed should be from selected store and challenging part is how to show price range facet according customer's selected store in search results page.

Let me give an example, we have Product-P under Store-A, Store-B with prices 8$ & 12$ respectively.
Assume user-1 selected preferred store as Store-B then the Product-P will have price 12$ and same way for Store-A price will be 8$.
So, Product should be shown under 0-10$ price range if user selects Store-A else it will be under 10-20$ price range for Store-B.

So what is the best approach to index data into SOLR search engine so that price facet range can be handled effectively.

We can think of one approach, where we can index all individual store price for each product and each store price as individual attribute.
If product present under 40 stores then number of attributes will be around 55 attributes (including basic attributes) which is huge i think.
So please provide your valuable suggestions & thoughts.

Thanks,
Sreehareesh
ReplyDelete
Replies
Michael McCandlessNovember 29, 2013 at 4:18 PM
Hi Sreehareesh,

Could you email solr-user@lucene.apache.org with this question?
ReplyDelete
Replies
SreeeNovember 30, 2013 at 12:11 AM
Sure Mike, Thanks
ReplyDelete
Replies
UnknownDecember 17, 2013 at 5:55 AM
Michael,

WDYT about hacking SortedSetDV codec to dedup text values, map to global ords and write global ords to file?
ReplyDelete
Replies
Michael McCandlessDecember 18, 2013 at 12:51 PM
I think something along these lines should be possible :) It's just software!

On seeing a new segment, in the worst case, you'd need to go and rewrite all prior segments, right? (E.g. a new segment may insert a new label as globalOrd=0, and bump all past globalOrds by 1).

The taxo index avoids this because the ords are not actually ords, they are "term ids", assigned first come first serve. So new labels won't change any previously assigned ords. One nice property of this approach is that the common labels tend to get smaller ords = less storage since we use delta vInt encoding to write the ords for each document.
ReplyDelete
Replies
JigarSeptember 27, 2014 at 8:32 AM
As mentioned in blog about "Dynamic range faceting",
Using numeric doc-values, dynamic faceting over numeric ranges is possible.

Is there some way when faceted search is executed, we can retrieve the possible min/max values of numeric doc-values field with supplied custom ranges or some other way to do it ?

As i believe this can give application hint, and next search request can be much smarter, e.g custom ranges can be more specific ?

ReplyDelete
Replies
SrikrishnaFebruary 13, 2015 at 2:25 PM
Hi Mike
I have a requirement where in i need to update the Range Facet count based on the search term
my document has fields contents and timestamp. I get the initial counts for the Ranges say A(20) and B(30) for the facet field "timestamp" . Now when a user seraches for say "lucene" , i need to update the Range Facet counts for A and B with number of Documents matching "lucene" . I am clue less on how to achieve this. I am indexing "timestamp" as "NumericDocValuesField" field. Could you please help me out
thanks
Krishna
ReplyDelete
Replies
Srinivas DhanrajAugust 27, 2015 at 4:58 PM
I have a scenario where we need to add hierarchial attributes to a document after it has been indexed. The problem is that these attributes are not know during indexing. I understand that, only way to add information to a document after it has been index, is to update an existing DocValue field. Can somebody tell if there is a way to use the DocValue field information to build hierarchial facet?
ReplyDelete
Replies
UnknownMarch 8, 2017 at 8:24 AM
I have a requirement where I want to perform the dynamic range faceting for the indexed numericDocValues Field (say: "price", "salary"). In Lucene, SSDVFF supports drillsidewayssearch which retain the previous level facets and counts. Like SSDVFF, I need to retain previous level numeric range even after selecting any of the numeric field's range. Is Lucene support drillsideways search for dynamic range faceting internally like SSDVFF...? How can I achieve this??? Kindly provide your valuable suggestions & thoughts.
ReplyDelete
Replies

Add comment