Sunday, December 9, 2012

Fun with Lucene's faceted search module

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene's facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice "getting started" examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I'm sure there are more...

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Lucene's nightly performance benchmarks

I was curious about the performance of faceted search, so I added date facets, indexed as year/month/day hierarchy, to the nightly Lucene benchmarks. Specifically I added faceting to all TermQuerys that were already tested, and now we can watch this graph to track our faceted search performance over time. The date field is the timestamp of the most recent revision of each Wikipedia page.

Simple performance tests

I also ran some simple initial tests on a recent (5/2/2012) English Wikipedia export, which contains 30.2 GB of plain text across 33.3 million documents. By default, faceted search retrieves the counts of all facet values under the root node (years, in this case):
     Date (3994646)
       2012 (1990192)
       2011 (752327)
       2010 (380977)
       2009 (275152)
       2008 (271543)
       2007 (211688)
       2006 (98809)
       2005 (12846)
       2004 (1105)
       2003 (7)
It's interesting that 2012 has such a high count, even though this export only includes the first five months and two days of 2012. Wikipedia's pages are very actively edited!

The search index with facets grew only slightly (~2.3%, from 12.5 GB to 12.8 GB) because of the additional indexed facet field. The taxonomy index, which is a separate index used to map facets to fixed integer codes, was tiny: only 120 KB. The more unique facet values you have, the larger this index will be.

Next I compared search performance with and without faceting. A simple TermQuery (party), matching just over a million hits, was 51.2 queries per second (QPS) without facets and 3.4 QPS with facets. While this is a somewhat scary slowdown, it's the worst case scenario: TermQuery is very cheap to execute, and can easily match a large number of hits. The cost of faceting is in proportion to the number of hits. It would be nice to speed this up (patches welcome!).

I also tested a harder PhraseQuery ("the village"), matching 194 K hits: 3.8 QPS without facets and 2.8 QPS with facets, which is less of a hit because PhraseQuery takes more work to match each hit and generally matches fewer hits.

Loading facet data in RAM

For the above results I used the facet defaults, where the per-document facet values are left on disk during aggregation. If you have enough RAM you can also load all facet values into RAM using the CategoryListCache class. I tested this, and it gave nice speedups: the TermQuery was 73% faster (to 6.0 QPS) and the PhraseQuery was 19% faster.

However, there are downsides: it's time-consuming to initialize (4.6 seconds in my test), and not NRT-friendly, though this shouldn't be so hard to fix (patches welcome!). It also required a substantial 1.9 GB RAM, according to Lucene's RamUsageEstimator. We should be able to reduce this RAM usage by switching to Lucene's fast packed ints implementation from the current int[][] it uses today, or by using DocValues to hold the per-document facet data. I just opened LUCENE-4602 to explore DocValues and initial results look very promising.

Sampling

Next I tried sampling, where the facet module visits 1% of the hits (by default) and only aggregates counts for those. In the default mode, this sampling is used only to find the top N facet values, and then a second pass computes the correct count for each of those values. This is a good fit when the taxonomy is wide and flat, and counts are pretty evenly distributed. I tested that, but results were slower, because the date taxonomy is not wide and flat and has rather lopsided counts (2012 has the majority of hits).

You can also skip the second pass and then present approximate counts or a percentage value to the user. I tested that and saw sizable gains: the TermQuery was 248% (2.5X) faster (to 12.2 QPS) and the PhraseQuery was 29% faster (to 3.6 QPS). The sampling is also quite configurable: you can set the min and max sample sizes, the sample ratio, the threshold under which no sampling should happen, etc.

Lucene's facet module makes it trivial to add facets to your search application, and offers useful features like sampling, alternative aggregates, complements, RAM caching, and fully customizable interfaces for many aspects of faceting. I'm hopeful we can reduce the RAM consumption for caching, and speed up the overall performance, over time.

12 comments:

  1. hey Michael, I am new in lucene, want to create a nested segmented indexes ? how should i achieve this? pls give me some reference to study it...

    ReplyDelete
  2. Hi Anonymous,

    It's probably best to send an email to java-user@lucene.apache.org (subscribe first so you'll see the response) asking about nested segment indices (I'm not sure what that is...).

    ReplyDelete
  3. Hi Michael,

    Is it possible to create a incremental backup using lucene for disaster recovery? (for e.g. i have 100 indexed data at some ABC location and backup data at some other location. Now I want to add 10 new index to my ABC Location and at a same time want to take a backup at my backup location to create disaster recovery). Pls explain me how should I achieve this.

    ReplyDelete
  4. Hi Anonymous,

    Yes, this is straightforward: Lucene has support for hot backups out-of-the-box. E.g. see http://blog.mikemccandless.com/2012/03/transactional-lucene.html

    Also the just-committed replication module enables taking hot backups remotely, e.g. via http.

    ReplyDelete
  5. hi Michael,

    Can we take a data from two different index location ?

    ReplyDelete
  6. Hi Anonymous,

    Do you mean merging facet counts from two remote indices?

    ReplyDelete
    Replies
    1. yes, is it possible ?

      Delete
    2. Hi Anonymous,

      Yes it is possible ...
      Have a look at the thread "FacetedSearch and MultiReader" on the java-user list on Jun 21, 2013: http://markmail.org/thread/hdddwwj546pr5nm2 ... that describes a process to merge taxonomy indices and use a common taxonomy index for each remote index.

      We really need to have better support for it, specifically for the [common] case where you cannot merge the taxo indices up front. We have an issue open for this: https://issues.apache.org/jira/browse/LUCENE-4710 ... but no progress on it so far (patches welcome!).

      Delete
  7. is it reasonable to expect 1-2 seconds per query on -
    - x3885 intel, 12 gb
    - search both Facets and hits
    - 9 gb index
    - 40-80 millions hits
    - 7-8 facets
    ----------
    Can we limit the RAM consumption and avoid swap?

    ReplyDelete
  8. It depends heavily on the details of your setup, e.g. do you sort by field or score, how many facet labels per document, what kinds of queries you're running, etc.

    ReplyDelete
  9. Mike!

    Is facet the right way to do min/max/avg/count of a field? or is there a better way to do that

    we want to achieve something like q=avg(version)

    ReplyDelete
    Replies
    1. Hi Srividhya,

      Yes, you could do simple aggregations using Lucene's facet module, but it would likely require some custom code. Maybe ask on the user's list (java-user@lucene.apache.org)? Alternatively, Elasticsearch has a richer aggregations module that let's you directly compute expressions like this. Solr may have something too; I'm less familiar with it though.

      Delete