Sunday, February 24, 2013

Drill Sideways faceting with Lucene

Lucene's facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There's been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:



That's a nearly 3.8X speedup!

But beyond speedups there are also new features, and here I'll describe the new DrillSideways class being developed under LUCENE-4748.

In the typical faceted search UI, you'll see a column to the left of your search results showing each field (Price, Manufacturer, etc.) enabled for refinement, displaying the unique values or ranges along with their counts. The count tells you how many matches from your current search you'll see if you drill down on that value or range, by clicking it and adding that filter to your current search. If you go shopping for Internal SSDs at Newegg you'll see something like the drill down on the left.

If you click on one of those values at Newegg, say 129GB - 256GB, you'll notice that the field then disappears from the drill down options and jumps to the top of the page, as part of the search breadcrumb, where the only thing you can do with it is drill up by clicking the (X) to remove the filter. What if you drill down on several fields, but then want to explore alternate values for those fields? You can't really do that: you're stuck. This is a poor user experience, and I often find myself making excessive use of the back button to explore different options on sites that only offer drill down and drill up: the UI is too restrictive. Overstock.com is another example of a pure drill down/up UI.

Other sites improve on this by offering drill sideways, or the option to pick alternative or additional facet values for a field you've previously drilled down on.

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn's faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or'd value to filter on (in the multi-select case).

Whether the field offers drill sideways is orthogonal to whether the field is multi-select. For example, look at the date field at Search-lucene.com: it's a single-select field, but allows for drill sideways (unfortunately the counts seem to be buggy!).

Implementing Drill Sideways

How can one implement drill sideways? It's tricky because, once you've drilled into a specific value for a given field, the other values in that field will of course have zero drill down count since they were filtered out of the result set. So how do you compute those other counts?

A straightforward solution is the "hold one out" approach: if the original query is "foo", and you've drilled down on three fields (say A:a, B:b and C:c), then to compute the facet count for each of the fields A, B and C you re-run the query, computing drill down counts, each time with only the other fields filtered.

For example, the facet counts for A are the drill down counts from the query +foo +B:b +C:c (A:a is left out), while the facet counts for B are the drill down counts from the query +foo +A:a +C:c (B:b is left out), etc. You must also separately run the query with all filters (+foo +A:a +B:b +C:c) to get the actual hits. With Solr you explicitly specify this hold-on-out yourself as part of the facet request.

While this approach will produce the correct counts, it's costly because now you're running 4 queries instead of 1. You may be able to speed this up by using bitsets instead: produce a bitset of all matching documents for just the query "foo", then a bitset for each of the drill downs A, B and C, and then intersect the bitsets as above, and count facets for the resulting docs. Solr uses an approach like this, caching the bitsets across requests.

Yet another approach, and the one we took on LUCENE-4748, is to execute a single query that matches both hits as well as near-misses, which are documents that matched all but one of the filters. This is actually a standard BooleanQuery, where the original query is a MUST clause, and each filter is a SHOULD clause, with minNumberShouldMatch set to the number of drill down filters minus 1. Then, when each match is collected, if all filters match, it's a true hit and we collect the document accordingly and increment the drill down counts. But if it's a near-miss, because exactly one filter failed, then we instead increment drill sideways counts against that one field that didn't match.

Note that when you first drill down on a given field, the resulting drill sideways counts for that field will be the same as the drill down counts just before you drilled down. If the UI can save this state then a small optimization would be to avoid counting the drill sideways counts for the just-drilled-down field, and instead reuse the previous drill down counts.

The DrillSideways class is obviously very new, so I'm sure there are some exciting bugs in it (though it does have a good randomized test). Feedback and patches are welcome!

14 comments:

  1. The implementation approach of using minNumberShouldMatch is clever. Any idea if it seems better/faster than caching bit sets? My gut opinion is that bitsets will be faster, albeit require the memory (of course).

    ReplyDelete
  2. Hi David,

    I'm also curious about the performance tradeoffs, but I haven't tested. It'd be nice to separately test bitsets approach without caching, and bitsets approach with caching.

    ReplyDelete
  3. Hi Mike,

    I am building a site for my client on IBM WebSphere Commerce. WebSphere Commerce currently uses Apache Solr/Lucene 3.5 version... therefore we won't get the Drill Sideways faceting with Lucene in it.

    Is it possible that I can customize the WebSphere Commerce search to also include the Drill Sideways?
    OR
    Can you tell us what things do we need to change in order to get Solr 4.4 Drill Sideways functionality in Solr 3.5?

    Many Thanks
    Sid.

    ReplyDelete
  4. Hi Sid,

    Solr has its own faceting implementation, separate from Lucene's faceting module, and it supports something equivalent (I think?) to drill sideways, which is called multi-select faceting. You'll need to pass filters for each dimension, and then exclude each dimensions filter when computing facet counts for it. See http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams for details.

    Also, 3.5 is quite old at this point ... best to upgrade :) 4.5 is right around the corner.

    ReplyDelete
  5. Hi Micheal,
    Really appreciate if you could add me on my Skype.. I would have to get a good approach on doing so from you.
    My Skype id is rc.shaikh.imran
    Meanwhile I will continue my readings on the above said wiki.


    Many Thanks

    ReplyDelete
  6. Michael,
    I wondered is there more B2C cases for single selection drillsideway facets? jirasearch and search-lucene.com don't seem like "real life examples".

    ReplyDelete
    Replies
    1. Hmm how about amazon.com? Search for cameras, drill-down e.g. on resolution and you still see the other resolutions (with their counts), and the counts update correctly even if you drill down on another dimension like Brand.

      Now if only amazon.com would set the cursor focus into their search box when you load their page in the browser...

      Delete
    2. Absolutely. You help me so much once again! Drill sideways forever!

      Delete
  7. Hi Mike i wonder how to make the FacetField increment if it see the same value again? Sketch follows:

    doc.add(FecetField( "Something", "Some value") .. so the point is if ti see the same "Some value" in the same doc again i want to increment facet count (to be Something - Some value(2) for that doc). can i achieve that? from some exceptions i saw something like .isseen or something so i assume it won`t allow duplicate for that doc but still...?

    ReplyDelete
    Replies
    1. Hi Anonymous,

      That's not very easy: all of the facet code effectively counts something like count(distinct(docID)) per facet value ...

      Maybe you could do something hacky, like if the same value appears more than once per document, attach a unique suffix during index, and then remove that suffix & combine counts at search time.

      Maybe share more context about what you are doing / why you want to count the same value more than once?

      Delete
    2. The facet gathers firm names from contracts. Some firms have more than one "positions" in contract along with other firms. The firms are not known in advance which is perfect fit for facets to "discover" them but I want to know the exact counts. Some advice?

      Delete
    3. after thinking for a while there it will be nice to explicitly pass the facet a Int to be accepted for a value, and provide the count by any means of user implementation.

      Delete
    4. like IntAssociationFacetField. Note Its marked experimental.

      Delete
  8. The facet gathers firm names from contracts. Some firms have more than one "positions" in contract along with other firms. The firms are not known in advance which is perfect fit for facets to "discover" them but I want to know the exact counts. Some advice?

    ReplyDelete