Lucene's facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice "getting started" examples in his second post.
The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I'm sure there are more...
The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.
Lucene's nightly performance benchmarks
I was curious about the performance of faceted search, so I added date facets, indexed as
year/month/day
hierarchy, to
the nightly
Lucene benchmarks. Specifically I added faceting to
all TermQuery
s that were already tested, and now we can
watch
this
graph to track our faceted search performance over time. The date
field is the timestamp
of the most recent revision of each
Wikipedia page.
Simple performance tests
I also ran some simple initial tests on a recent (5/2/2012) English Wikipedia export, which contains 30.2 GB of plain text across 33.3 million documents. By default, faceted search retrieves the counts of all facet values under the root node (years, in this case):
Date (3994646) 2012 (1990192) 2011 (752327) 2010 (380977) 2009 (275152) 2008 (271543) 2007 (211688) 2006 (98809) 2005 (12846) 2004 (1105) 2003 (7)It's interesting that 2012 has such a high count, even though this export only includes the first five months and two days of 2012. Wikipedia's pages are very actively edited!
The search index with facets grew only slightly (~2.3%, from 12.5 GB to 12.8 GB) because of the additional indexed facet field. The taxonomy index, which is a separate index used to map facets to fixed integer codes, was tiny: only 120 KB. The more unique facet values you have, the larger this index will be.
Next I compared search performance with and without faceting. A simple
TermQuery
(party
), matching just over
a million hits, was 51.2 queries per second (QPS) without facets and
3.4 QPS with facets. While this is a somewhat scary slowdown, it's
the worst case scenario: TermQuery
is very cheap to
execute, and can easily match a large number of hits. The cost of
faceting is in proportion to the number of hits. It would be nice to
speed this up (patches welcome!).
I also tested a harder
PhraseQuery
("the
village"
), matching 194 K hits: 3.8 QPS without facets and 2.8
QPS with facets, which is less of a hit
because PhraseQuery
takes more work to match each hit and
generally matches fewer hits.
Loading facet data in RAM
For the above results I used the facet defaults, where the per-document facet values are left on disk during aggregation. If you have enough RAM you can also load all facet values into RAM using the
CategoryListCache
class. I tested this, and it gave
nice speedups: the TermQuery
was 73% faster (to 6.0 QPS)
and the PhraseQuery
was 19% faster.
However, there are downsides: it's time-consuming to initialize (4.6 seconds in my test), and not NRT-friendly, though this shouldn't be so hard to fix (patches welcome!). It also required a substantial 1.9 GB RAM, according to Lucene's
RamUsageEstimator
.
We should be able to reduce this RAM usage by switching to
Lucene's fast
packed ints implementation from the current int[][] it uses today,
or by
using DocValues
to hold the per-document facet data. I just
opened LUCENE-4602
to explore DocValues
and initial results look very promising.
Sampling
Next I tried sampling, where the facet module visits 1% of the hits (by default) and only aggregates counts for those. In the default mode, this sampling is used only to find the top N facet values, and then a second pass computes the correct count for each of those values. This is a good fit when the taxonomy is wide and flat, and counts are pretty evenly distributed. I tested that, but results were slower, because the date taxonomy is not wide and flat and has rather lopsided counts (2012 has the majority of hits).
You can also skip the second pass and then present approximate counts or a percentage value to the user. I tested that and saw sizable gains: the
TermQuery
was 248% (2.5X) faster (to 12.2 QPS)
and the PhraseQuery
was 29% faster (to 3.6 QPS). The
sampling is also quite configurable: you can set the min and max
sample sizes, the sample ratio, the threshold under which no sampling
should happen, etc.
Lucene's facet module makes it trivial to add facets to your search application, and offers useful features like sampling, alternative aggregates, complements, RAM caching, and fully customizable interfaces for many aspects of faceting. I'm hopeful we can reduce the RAM consumption for caching, and speed up the overall performance, over time.
hey Michael, I am new in lucene, want to create a nested segmented indexes ? how should i achieve this? pls give me some reference to study it...
ReplyDeleteHi Anonymous,
ReplyDeleteIt's probably best to send an email to java-user@lucene.apache.org (subscribe first so you'll see the response) asking about nested segment indices (I'm not sure what that is...).
Hi Michael,
ReplyDeleteIs it possible to create a incremental backup using lucene for disaster recovery? (for e.g. i have 100 indexed data at some ABC location and backup data at some other location. Now I want to add 10 new index to my ABC Location and at a same time want to take a backup at my backup location to create disaster recovery). Pls explain me how should I achieve this.
Hi Anonymous,
ReplyDeleteYes, this is straightforward: Lucene has support for hot backups out-of-the-box. E.g. see http://blog.mikemccandless.com/2012/03/transactional-lucene.html
Also the just-committed replication module enables taking hot backups remotely, e.g. via http.
hi Michael,
ReplyDeleteCan we take a data from two different index location ?
Hi Anonymous,
ReplyDeleteDo you mean merging facet counts from two remote indices?
yes, is it possible ?
DeleteHi Anonymous,
DeleteYes it is possible ...
Have a look at the thread "FacetedSearch and MultiReader" on the java-user list on Jun 21, 2013: http://markmail.org/thread/hdddwwj546pr5nm2 ... that describes a process to merge taxonomy indices and use a common taxonomy index for each remote index.
We really need to have better support for it, specifically for the [common] case where you cannot merge the taxo indices up front. We have an issue open for this: https://issues.apache.org/jira/browse/LUCENE-4710 ... but no progress on it so far (patches welcome!).
is it reasonable to expect 1-2 seconds per query on -
ReplyDelete- x3885 intel, 12 gb
- search both Facets and hits
- 9 gb index
- 40-80 millions hits
- 7-8 facets
----------
Can we limit the RAM consumption and avoid swap?
It depends heavily on the details of your setup, e.g. do you sort by field or score, how many facet labels per document, what kinds of queries you're running, etc.
ReplyDeleteMike!
ReplyDeleteIs facet the right way to do min/max/avg/count of a field? or is there a better way to do that
we want to achieve something like q=avg(version)
Hi Srividhya..
DeleteHave you found a solution for your problem? I am also interested in knowing how to use Lucene for aggregation like AVG, SUM..
Awesome writing :)
ReplyDelete