Tuesday, May 21, 2013

Dynamic faceting with Lucene

Lucene's facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I'll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene's facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch's and Solr's, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

Enough background, now on to our two new features!


Sorted-set doc-values faceting

These features bring two dynamic alternatives to the facet module, both computing facet counts from previously indexed doc-values fields. The first feature, sorted-set doc-values faceting (see LUCENE-4795), allows the application to index a normal sorted-set doc-values field, for example:
  doc.add(new SortedSetDocValuesField("foo"));
  doc.add(new SortedSetDocValuesField("bar"));
and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.

This feature does not use the taxonomy index, since all state is stored in the doc-values, but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed to map per-segment integer ordinals to global ordinals. The good news is this should be relatively low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit the documents (unlike UnInvertedField).

At search time there is also a small performance hit (~25%, depending on the query) since each per-segment ord must be re-mapped to the global ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this feature currently only works with non-hierarchical facet fields, though this should be fixable (patches welcome!).


Dynamic range faceting

The second new feature, dynamic range faceting, works on top of a numeric doc-values field (see LUCENE-4965), and implements dynamic faceting over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels. Each matched document is checked against all ranges and the count is incremented when there is a match. The range-test is a naive simple linear search, which is probably OK since there are usually only a few ranges, but we could eventually upgrade this to an interval tree to get better performance (patches welcome!).

Likewise, this new feature does not use the taxonomy index, only a numeric doc-values field. This feature is especially useful with time-based fields. You can see it in action in the Jira issues search example in the Updated field.

Happy faceting!

13 comments:

  1. Mike, this is great.

    Is there any way to tell when the 4.4 release will take place?

    ReplyDelete
  2. Hi 12013,

    Well, it's open-source, so there's no hard guarantee :) It ships when it ships ... but, generally we've been doing .x releases every few months.

    You can also easily build the Lucene JARs off a 4.x branch checkout.

    ReplyDelete
  3. Hi, Michael, how are you doing?

    My name is Tony.

    I am particularity interested in the feature of Dynamic range facet.
    I am wondering do you know any resource that have examples of how to use this feature?

    Thank you very much!

    -tony

    ReplyDelete
  4. Hi Tungnian,

    Have a look at the unit test: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/facet/src/test/org/apache/lucene/facet/range/TestRangeAccumulator.java

    ReplyDelete
  5. Hi Mike,

    We have a requirement where we need to display facet for price range.

    Firstly let me explain our data structure.
    We have products and product will be present on multiple stores and each store has it's own price.
    Now the requirement is, when customer is logged in price displayed should be from selected store and challenging part is how to show price range facet according customer's selected store in search results page.

    Let me give an example, we have Product-P under Store-A, Store-B with prices 8$ & 12$ respectively.
    Assume user-1 selected preferred store as Store-B then the Product-P will have price 12$ and same way for Store-A price will be 8$.
    So, Product should be shown under 0-10$ price range if user selects Store-A else it will be under 10-20$ price range for Store-B.

    So what is the best approach to index data into SOLR search engine so that price facet range can be handled effectively.

    We can think of one approach, where we can index all individual store price for each product and each store price as individual attribute.
    If product present under 40 stores then number of attributes will be around 55 attributes (including basic attributes) which is huge i think.
    So please provide your valuable suggestions & thoughts.

    Thanks,
    Sreehareesh

    ReplyDelete
  6. Hi Sreehareesh,

    Could you email solr-user@lucene.apache.org with this question?

    ReplyDelete
  7. Michael,

    WDYT about hacking SortedSetDV codec to dedup text values, map to global ords and write global ords to file?

    ReplyDelete
    Replies
    1. Hi Mikhail,

      The challenge there is that the Codec API is strongly per-segment.

      This is what the taxonomy index does in Lucene's facet module, so that at search time we do all counting in the global ord space.

      For SortedSetDVFacetCounts, we create the OrdinalMap which computes the local -> global ord mapping for each segment.

      Delete
    2. 2. is clear, but you know.
      3. is it a cool thing or it's worth to move it to index-time, and we just need to find how exactly to do so?
      1. I understand that it's segmented, but not clear why we can't look at previous segments? eg
      - segment 1 has terms A,B,C we assign ords 1,2,3
      - then when we index segment 2 with terms B,D.
      - at first, we assign local ords 1,2, but
      - in the second pass we can remap them to global ords: B->2, D->4.
      - then segment 1 has terms (dictionary) A, B, C and segment 2 has only D term. with assigned global ord 4
      - the only additional data structure we need in query time is max(global-ord) -> segment. where we can find segment with particular term when we doing 'labeling' (map globord to text). this data structure is compact and incremental.
      - we've lost sortability definitely, and feel some pain during merge. However, this usecase seems useful for me. Do I miss something?

      Delete
  8. I think something along these lines should be possible :) It's just software!

    On seeing a new segment, in the worst case, you'd need to go and rewrite all prior segments, right? (E.g. a new segment may insert a new label as globalOrd=0, and bump all past globalOrds by 1).

    The taxo index avoids this because the ords are not actually ords, they are "term ids", assigned first come first serve. So new labels won't change any previously assigned ords. One nice property of this approach is that the common labels tend to get smaller ords = less storage since we use delta vInt encoding to write the ords for each document.

    ReplyDelete
    Replies
    1. hm. I see my example is not representative enough. My idea is that we can lose ordering of ordinals.
      segment #1
      B-> 1
      D-> 2
      E-> 3

      segment #2
      pass 1 local ord
      A->1
      B->2
      C->3
      D->4
      F->5
      pass 2 dedup, new ords start from max ords@seg#1 +1
      A->4 - inc
      B->1 - dedupe, omit
      C->5 - inc
      D->2 - dedupe, omit
      F->6 - inc
      segment#2 term-ords
      A->4
      C->5
      F->6
      documents with B,D,E obtains global ords 1..3 from segment mapping. We can't sort by these ords, but faceting should be really NRT, and memory cheap.

      Idea of taxo index is clear to me.

      Delete
    2. OK, now I see. So you want to assign term IDs, not ords (just like taxo index), but instead of storing the mapping separately, like taxo index does, you want to do it as a clever DVFormat, which would be awesome.

      So the DVFormat would somehow "see" all existing segments mappings (I guess load & union them on init), and then only write any new labels to the current segment.

      Why would merging be hard? It seems straightforward? Unless you want to try to "garbage collect" any now-unused term IDs, which would indeed be hard; I wouldn't try to do that on the first go.

      Delete