Changing Bits: May 2013

Tuesday, May 21, 2013

Dynamic faceting with Lucene

Lucene's facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I'll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene's facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch's and Solr's, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

Enough background, now on to our two new features!

Sorted-set doc-values faceting

These features bring two dynamic alternatives to the facet module, both computing facet counts from previously indexed doc-values fields. The first feature, sorted-set doc-values faceting (see LUCENE-4795), allows the application to index a normal sorted-set doc-values field, for example:

  doc.add(new SortedSetDocValuesField("foo"));
  doc.add(new SortedSetDocValuesField("bar"));

and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.

This feature does not use the taxonomy index, since all state is stored in the doc-values, but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed to map per-segment integer ordinals to global ordinals. The good news is this should be relatively low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit the documents (unlike UnInvertedField).

At search time there is also a small performance hit (~25%, depending on the query) since each per-segment ord must be re-mapped to the global ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this feature currently only works with non-hierarchical facet fields, though this should be fixable (patches welcome!).

Dynamic range faceting

The second new feature, dynamic range faceting, works on top of a numeric doc-values field (see LUCENE-4965), and implements dynamic faceting over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels. Each matched document is checked against all ranges and the count is incremented when there is a match. The range-test is a naive simple linear search, which is probably OK since there are usually only a few ranges, but we could eventually upgrade this to an interval tree to get better performance (patches welcome!).

Likewise, this new feature does not use the taxonomy index, only a numeric doc-values field. This feature is especially useful with time-based fields. You can see it in action in the Jira issues search example in the Updated field.

Happy faceting!

Monday, May 13, 2013

Eating dog food with Lucene

Eating your own dog food is important in all walks of life: if you are a chef you should taste your own food; if you are a doctor you should treat yourself when you are sick; if you build houses for a living you should live in a house you built; if you are a parent then try living by the rules that you set for your kids (most parents would fail miserably at this!); and if you build software you should constantly use your own software.

So, for the past few weeks I've been doing exactly that: building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira's search whenever I need to go find an issue.

It's currently running at jirasearch.mikemccandless.com and it's still quite rough (feedback welcome!).

It's a good showcase of a number of Lucene features:

Highlighting using the new PostingsHighlighter; for example, try searching for fuzzy query.
Autosuggest, using the not-yet-committed AnalyzingInfixSuggester (LUCENE-4845).
Sorting by various fields, including a blended recency and relevance sort.
A few synonym examples, for example try searching for oome.
Near-real-time searching, and controlled searcher versions: the server uses NRTManager, SearcherLifetimeManager and SearcherManager.
ToParentBlockJoinQuery: each issue is indexed as a parent document, and then each comment on the issue is indexed as a separate child document. This allows the server to know which specific comment, along with its metadata, was a match for the query, and if you click on that comment (in the highlighted results) it will take you to that comment in Jira. This is very helpful for mega-issues!
Okapi BM25 for ranking.

The drill-downs on the left also show a number of features from Lucene's facet module:

Drill sideways for all fields, so that the field does not disappear when you drill down on it.
Dynamic range faceting: the Updated drill-down is computed dynamically, e.g. all issues updated in the past week.
Hierarchical fields, which are simple since the Lucene facet module supports hierarchy natively. Only the Component dimension is hierarchical, e.g. look at the Component drill down for all Lucene core issues.
Multi-select faceting (hold down the shift key when clicking on a value), e.g. all improvements and new features.
Multi-valued fields (e.g. User, Fix version, Label).

This is really eating two different dog foods: first, as a developer I see what one must go through to build a search server on top of Lucene's APIs, but second, as an end user, I experience the resulting search user interface whenever I need to find a Lucene or Solr issue. It's like having to eat both wet and dry dog food at once, and both kinds of dog food have uncovered numerous issues!

The issues ranged from outright bugs such as PostingsHighlighter picking the worst snippets instead of the best (LUCENE-4826), to missing features like dynamic numeric range facets (LUCENE-4965), to issues that make consuming Lucene's APIs awkward, especially when mixing different features, such as the difficulty of mixing non-range and range facets with DrillSideways (LUCENE-4980) and the difficulty of using NRTManager with both a taxonomy index and a search index (LUCENE-4967), or finally just inefficient, such as the inability to customize how PostingsHighlighter loads its field values (LUCENE-4846).

The process is far from done! There are still a number of issues that need fixing. For example, it's not easy to mix ToParentBlockJoinQuery and grouping, which is frustrating because fields like who reported an issue, severity, issue status, component would all be natural group-by fields. Some issues, such as the inability of PostingsFormatter to render directly to a JSONObject (LUCENE-4896) are still open because they are challenging to fix cleanly. Others, such as the infix suggester (LUCENE-4845) are in limbo because of disagreements on the best approach, while still others, like BlendedComparator used to sort by mixed relevance and recency, I just haven't pushed back into Lucene yet.

There are plenty of ways to provoke an error from the server; here are two fun examples: try fielded search such as summary:python, or a multi-select drilldown on the Updated field.

Much work remains and I'll keep on eating both wet and dry dog food: it's a very productive way to find problems in Lucene!