Tuesday, May 21, 2013

Dynamic faceting with Lucene

Lucene's facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I'll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene's facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch's and Solr's, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

Enough background, now on to our two new features!


Sorted-set doc-values faceting

These features bring two dynamic alternatives to the facet module, both computing facet counts from previously indexed doc-values fields. The first feature, sorted-set doc-values faceting (see LUCENE-4795), allows the application to index a normal sorted-set doc-values field, for example:
  doc.add(new SortedSetDocValuesField("foo"));
  doc.add(new SortedSetDocValuesField("bar"));
and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.

This feature does not use the taxonomy index, since all state is stored in the doc-values, but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed to map per-segment integer ordinals to global ordinals. The good news is this should be relatively low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit the documents (unlike UnInvertedField).

At search time there is also a small performance hit (~25%, depending on the query) since each per-segment ord must be re-mapped to the global ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this feature currently only works with non-hierarchical facet fields, though this should be fixable (patches welcome!).


Dynamic range faceting

The second new feature, dynamic range faceting, works on top of a numeric doc-values field (see LUCENE-4965), and implements dynamic faceting over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels. Each matched document is checked against all ranges and the count is incremented when there is a match. The range-test is a naive simple linear search, which is probably OK since there are usually only a few ranges, but we could eventually upgrade this to an interval tree to get better performance (patches welcome!).

Likewise, this new feature does not use the taxonomy index, only a numeric doc-values field. This feature is especially useful with time-based fields. You can see it in action in the Jira issues search example in the Updated field.

Happy faceting!

Monday, May 13, 2013

Eating dog food with Lucene


Eating your own dog food is important in all walks of life: if you are a chef you should taste your own food; if you are a doctor you should treat yourself when you are sick; if you build houses for a living you should live in a house you built; if you are a parent then try living by the rules that you set for your kids (most parents would fail miserably at this!); and if you build software you should constantly use your own software.

So, for the past few weeks I've been doing exactly that: building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira's search whenever I need to go find an issue.

It's currently running at jirasearch.mikemccandless.com and it's still quite rough (feedback welcome!).

It's a good showcase of a number of Lucene features:

  • Highlighting using the new PostingsHighlighter; for example, try searching for fuzzy query.

  • Autosuggest, using the not-yet-committed AnalyzingInfixSuggester (LUCENE-4845).

  • Sorting by various fields, including a blended recency and relevance sort.

  • A few synonym examples, for example try searching for oome.

  • Near-real-time searching, and controlled searcher versions: the server uses NRTManager, SearcherLifetimeManager and SearcherManager.

  • ToParentBlockJoinQuery: each issue is indexed as a parent document, and then each comment on the issue is indexed as a separate child document. This allows the server to know which specific comment, along with its metadata, was a match for the query, and if you click on that comment (in the highlighted results) it will take you to that comment in Jira. This is very helpful for mega-issues!

  • Okapi BM25 for ranking.

The drill-downs on the left also show a number of features from Lucene's facet module:
  • Drill sideways for all fields, so that the field does not disappear when you drill down on it.

  • Dynamic range faceting: the Updated drill-down is computed dynamically, e.g. all issues updated in the past week.

  • Hierarchical fields, which are simple since the Lucene facet module supports hierarchy natively. Only the Component dimension is hierarchical, e.g. look at the Component drill down for all Lucene core issues.

  • Multi-select faceting (hold down the shift key when clicking on a value), e.g. all improvements and new features.

  • Multi-valued fields (e.g. User, Fix version, Label).



This is really eating two different dog foods: first, as a developer I see what one must go through to build a search server on top of Lucene's APIs, but second, as an end user, I experience the resulting search user interface whenever I need to find a Lucene or Solr issue. It's like having to eat both wet and dry dog food at once, and both kinds of dog food have uncovered numerous issues!

The issues ranged from outright bugs such as PostingsHighlighter picking the worst snippets instead of the best (LUCENE-4826), to missing features like dynamic numeric range facets (LUCENE-4965), to issues that make consuming Lucene's APIs awkward, especially when mixing different features, such as the difficulty of mixing non-range and range facets with DrillSideways (LUCENE-4980) and the difficulty of using NRTManager with both a taxonomy index and a search index (LUCENE-4967), or finally just inefficient, such as the inability to customize how PostingsHighlighter loads its field values (LUCENE-4846).

The process is far from done! There are still a number of issues that need fixing. For example, it's not easy to mix ToParentBlockJoinQuery and grouping, which is frustrating because fields like who reported an issue, severity, issue status, component would all be natural group-by fields. Some issues, such as the inability of PostingsFormatter to render directly to a JSONObject (LUCENE-4896) are still open because they are challenging to fix cleanly. Others, such as the infix suggester (LUCENE-4845) are in limbo because of disagreements on the best approach, while still others, like BlendedComparator used to sort by mixed relevance and recency, I just haven't pushed back into Lucene yet.

There are plenty of ways to provoke an error from the server; here are two fun examples: try fielded search such as summary:python, or a multi-select drilldown on the Updated field.

Much work remains and I'll keep on eating both wet and dry dog food: it's a very productive way to find problems in Lucene!

Sunday, February 24, 2013

Drill Sideways faceting with Lucene

Lucene's facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There's been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:



That's a nearly 3.8X speedup!

But beyond speedups there are also new features, and here I'll describe the new DrillSideways class being developed under LUCENE-4748.

In the typical faceted search UI, you'll see a column to the left of your search results showing each field (Price, Manufacturer, etc.) enabled for refinement, displaying the unique values or ranges along with their counts. The count tells you how many matches from your current search you'll see if you drill down on that value or range, by clicking it and adding that filter to your current search. If you go shopping for Internal SSDs at Newegg you'll see something like the drill down on the left.

If you click on one of those values at Newegg, say 129GB - 256GB, you'll notice that the field then disappears from the drill down options and jumps to the top of the page, as part of the search breadcrumb, where the only thing you can do with it is drill up by clicking the (X) to remove the filter. What if you drill down on several fields, but then want to explore alternate values for those fields? You can't really do that: you're stuck. This is a poor user experience, and I often find myself making excessive use of the back button to explore different options on sites that only offer drill down and drill up: the UI is too restrictive. Overstock.com is another example of a pure drill down/up UI.

Other sites improve on this by offering drill sideways, or the option to pick alternative or additional facet values for a field you've previously drilled down on.

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn's faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or'd value to filter on (in the multi-select case).

Whether the field offers drill sideways is orthogonal to whether the field is multi-select. For example, look at the date field at Search-lucene.com: it's a single-select field, but allows for drill sideways (unfortunately the counts seem to be buggy!).

Implementing Drill Sideways

How can one implement drill sideways? It's tricky because, once you've drilled into a specific value for a given field, the other values in that field will of course have zero drill down count since they were filtered out of the result set. So how do you compute those other counts?

A straightforward solution is the "hold one out" approach: if the original query is "foo", and you've drilled down on three fields (say A:a, B:b and C:c), then to compute the facet count for each of the fields A, B and C you re-run the query, computing drill down counts, each time with only the other fields filtered.

For example, the facet counts for A are the drill down counts from the query +foo +B:b +C:c (A:a is left out), while the facet counts for B are the drill down counts from the query +foo +A:a +C:c (B:b is left out), etc. You must also separately run the query with all filters (+foo +A:a +B:b +C:c) to get the actual hits. With Solr you explicitly specify this hold-on-out yourself as part of the facet request.

While this approach will produce the correct counts, it's costly because now you're running 4 queries instead of 1. You may be able to speed this up by using bitsets instead: produce a bitset of all matching documents for just the query "foo", then a bitset for each of the drill downs A, B and C, and then intersect the bitsets as above, and count facets for the resulting docs. Solr uses an approach like this, caching the bitsets across requests.

Yet another approach, and the one we took on LUCENE-4748, is to execute a single query that matches both hits as well as near-misses, which are documents that matched all but one of the filters. This is actually a standard BooleanQuery, where the original query is a MUST clause, and each filter is a SHOULD clause, with minNumberShouldMatch set to the number of drill down filters minus 1. Then, when each match is collected, if all filters match, it's a true hit and we collect the document accordingly and increment the drill down counts. But if it's a near-miss, because exactly one filter failed, then we instead increment drill sideways counts against that one field that didn't match.

Note that when you first drill down on a given field, the resulting drill sideways counts for that field will be the same as the drill down counts just before you drilled down. If the UI can save this state then a small optimization would be to avoid counting the drill sideways counts for the just-drilled-down field, and instead reuse the previous drill down counts.

The DrillSideways class is obviously very new, so I'm sure there are some exciting bugs in it (though it does have a good randomized test). Feedback and patches are welcome!

Saturday, January 26, 2013

Getting real-time field values in Lucene

We know Lucene's near-real-time search is very fast: you can easily refresh your searcher once per second, even at high indexing rates, so that any change to the index is available for searching or faceting at most one second later. For most applications this is plenty fast.

But what if you sometimes need even better than near-real-time? What if you need to look up truly live or real-time values, so for any document id you can retrieve the very last value indexed?

Just use the newly committed LiveFieldValues class!

It's simple to use: when you instantiate it you provide it with your SearcherManager or NRTManager, so that it can subscribe to the RefreshListener to be notified when new searchers are opened, and then whenever you add, update or delete a document, you notify the LiveFieldValues instance. Finally, call the get method to get the last indexed value for a given document id.

This class is simple inside: it holds the values of recently indexed documents in a ConcurrentHashMap, keyed by the document id, to hold documents that were just indexed but not yet available through the near-real-time searcher. Whenever a new near-real-time searcher is successfully opened, it clears the map of all entries that are now included in that searcher. It carefully handles the transition time from when the reopen started to when it finished by checking two maps for the possible value, and failing that, it falls back to the current searcher.

LiveFieldValues is abstract: you must subclass it and implement the lookupFromSearcher method to retrieve a document's value from an IndexSearcher, since how your application stores the values in the searcher is application dependent (stored fields, doc values or even postings, payloads or term vectors).

Note that this class only offers "live get", i.e. you can get the last indexed value for any document, but it does not offer "live search", i.e. you cannot search against the value until the searcher is reopened. Also, the internal maps are only pruned after a new searcher is opened, so RAM usage will grow unbounded if you never reopen! It's up to your application to ensure that the same document id is never updated simultaneously (in different threads) because in that case you cannot know which update "won" (Lucene does not expose this information, although LUCENE-3424 is one possible solution for this).

An example use-case is to store a version field per document so that you know the last version indexed for a given id; you can then use this to reject a later but out-of-order update for that same document whose version is older than the version already indexed.

LiveFieldValues will be available in the next Lucene release (4.2).

Thursday, January 10, 2013

Taming Text is released!

There's a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.

I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don't follow a few simple, easily codified rules and are still not fully understood today.

The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!

N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.

The final chapter, "Untamed Text", is especially fun: the sections, some of which are contributed by additional authors, address very challenging topics like semantics extraction, document summarization, relationship extraction, identifying important content and people, detecting emotions with sentiment analysis and cross-language information retrieval.

There were a few topics I expected to see but seemed to be missing. There was no coverage of the Unicode standard (e.g. encodings, and useful standards such as UAX#29 text segmentation). Multi-lingual issues were not addressed; all examples are English. Finite-state transducers were also missing, even though these are powerful tools for text processing. Lucene uses FSTs in a number of places: efficient synonym-filtering, character filtering during analysis, fast auto-suggest, tokenizing Japanese text, in-memory postings format. Still, it's fine that some topics are missing: text processing is an immense field and something has to be cut!

The book is unfortunately based on Lucene/Solr 3.x, so new features only in Lucene/Solr 4.0 are missing, for example the new DirectSpellChecker, scoring models beyond TF/IDF Vector Space Model. Chapter 4, Fuzzy text searching, didn't mention Lucene's new FuzzyQuery nor the very fast Levenshtein Automata approach it uses for finding all fuzzy matches from a large set of terms.

All in all the book is a great introduction to how to leverage numerous open-source tools to process text.

Sunday, December 30, 2012

A new Lucene highlighter is born

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it's the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google's Chrome browser has an ingenious solution to the final inch problem, when you use "Find..." to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn't store the full token graph.

Unlike the existing highlighters, which rely on term-vectors or on re-analysis of each matched document to obtain the per-token offsets, PostingsHighlighter uses the recently added postings offsets feature. To index postings offsets you must set the field to be highlighted to use FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS option during indexing.

It turns out postings offsets is much more efficient storage for offsets because the default codec (currently Lucene41) does a good job compressing them: ~1.1 byte per position, which includes both start and end offset. In contrast, term vectors require substantially more disk space (~7.8X for the 10 million document English Wikipedia index), slow down indexing and merging, and are slow to access at search time. A smaller index also means the "working set" size, i.e. the net number of bytes that your search application frequently hits from disk, will be smaller, so you'll need less RAM to keep the index hot.

PostingsHighlighter uses a BreakIterator to find passages in the text; by default it breaks using getSentenceIterator. It then iterates in parallel (merge sorting by offset) through the positions of all terms from the query, coalescing those hits that occur in a single passage into a Passage, and then scores each Passage using a separate PassageScorer.

The scoring model is fun: it treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. The default PassageScorer uses BM25 scoring, biased with a normalization factor that favors passages occurring closer to the start of the document, but it's pluggable so you can implement your own scoring (and feel free to share if you find an improvement!).

This new highlighter should be substantially faster than our existing highlighters on a cold index (when the index doesn't fit entirely into available RAM), as it does more sequential IO instead of seek-heavy random access. Furthermore, as you increase the number of top hits, the performance gains should be even better. Also, the larger the documents the better the performance gains should be.

One known limitation is that it can only highlight a single field at a time, i.e. you cannot pass it N fields and have it pick the best passages across all of them, though both existing highlighters have the same limitation. The code is very new and may still have some exciting bugs! This is why it's located under Lucene's sandbox module.

If you are serious about highlighting in your search application (and you should be!) then PostingsHighlighter is well worth a look!

Sunday, December 9, 2012

Fun with Lucene's faceted search module

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene's facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice "getting started" examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I'm sure there are more...

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Lucene's nightly performance benchmarks

I was curious about the performance of faceted search, so I added date facets, indexed as year/month/day hierarchy, to the nightly Lucene benchmarks. Specifically I added faceting to all TermQuerys that were already tested, and now we can watch this graph to track our faceted search performance over time. The date field is the timestamp of the most recent revision of each Wikipedia page.

Simple performance tests

I also ran some simple initial tests on a recent (5/2/2012) English Wikipedia export, which contains 30.2 GB of plain text across 33.3 million documents. By default, faceted search retrieves the counts of all facet values under the root node (years, in this case):
     Date (3994646)
       2012 (1990192)
       2011 (752327)
       2010 (380977)
       2009 (275152)
       2008 (271543)
       2007 (211688)
       2006 (98809)
       2005 (12846)
       2004 (1105)
       2003 (7)
It's interesting that 2012 has such a high count, even though this export only includes the first five months and two days of 2012. Wikipedia's pages are very actively edited!

The search index with facets grew only slightly (~2.3%, from 12.5 GB to 12.8 GB) because of the additional indexed facet field. The taxonomy index, which is a separate index used to map facets to fixed integer codes, was tiny: only 120 KB. The more unique facet values you have, the larger this index will be.

Next I compared search performance with and without faceting. A simple TermQuery (party), matching just over a million hits, was 51.2 queries per second (QPS) without facets and 3.4 QPS with facets. While this is a somewhat scary slowdown, it's the worst case scenario: TermQuery is very cheap to execute, and can easily match a large number of hits. The cost of faceting is in proportion to the number of hits. It would be nice to speed this up (patches welcome!).

I also tested a harder PhraseQuery ("the village"), matching 194 K hits: 3.8 QPS without facets and 2.8 QPS with facets, which is less of a hit because PhraseQuery takes more work to match each hit and generally matches fewer hits.

Loading facet data in RAM

For the above results I used the facet defaults, where the per-document facet values are left on disk during aggregation. If you have enough RAM you can also load all facet values into RAM using the CategoryListCache class. I tested this, and it gave nice speedups: the TermQuery was 73% faster (to 6.0 QPS) and the PhraseQuery was 19% faster.

However, there are downsides: it's time-consuming to initialize (4.6 seconds in my test), and not NRT-friendly, though this shouldn't be so hard to fix (patches welcome!). It also required a substantial 1.9 GB RAM, according to Lucene's RamUsageEstimator. We should be able to reduce this RAM usage by switching to Lucene's fast packed ints implementation from the current int[][] it uses today, or by using DocValues to hold the per-document facet data. I just opened LUCENE-4602 to explore DocValues and initial results look very promising.

Sampling

Next I tried sampling, where the facet module visits 1% of the hits (by default) and only aggregates counts for those. In the default mode, this sampling is used only to find the top N facet values, and then a second pass computes the correct count for each of those values. This is a good fit when the taxonomy is wide and flat, and counts are pretty evenly distributed. I tested that, but results were slower, because the date taxonomy is not wide and flat and has rather lopsided counts (2012 has the majority of hits).

You can also skip the second pass and then present approximate counts or a percentage value to the user. I tested that and saw sizable gains: the TermQuery was 248% (2.5X) faster (to 12.2 QPS) and the PhraseQuery was 29% faster (to 3.6 QPS). The sampling is also quite configurable: you can set the min and max sample sizes, the sample ratio, the threshold under which no sampling should happen, etc.

Lucene's facet module makes it trivial to add facets to your search application, and offers useful features like sampling, alternative aggregates, complements, RAM caching, and fully customizable interfaces for many aspects of faceting. I'm hopeful we can reduce the RAM consumption for caching, and speed up the overall performance, over time.

Monday, November 19, 2012

Lucene with Zing, Part 2

When I last tested Lucene with the Zing JVM the results were impressive: Zing's fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene's 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory's max throughput was now better than RAMDirectory's (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory's performance, and lower the pressure on concurrent mark/sweep's GC to the point where pause times were so much lower in GA compared to alpha?

Fortunately, Zing includes a nice tool, gcLogAnalyser, to visualize all sorts of data extracted from its GC logs. Like Zing's JVM, this tool is also free for open-source development. To enable GC logging you should pass -verbose:gc and -XX:+PrintGCDetails to the Zing JVM, and it's also a good idea to redirect all GC output to a separate file using -Xloggc so that any other output isn't mixed in.

Using gcLogAnalyser it was clear that, for my test, Lucene 4.0.0 alpha had a much higher allocation rate compared to 4.0.0 GA, and from that hint I isolated the difference to this sneaky little performance bug (LUCENE-4289). That issue, present in 4.0.0 alpha and now fixed in 4.0.0 GA, didn't break anything but was causing FastVectorHighlighter to do far too many term lookups internally, consuming too much CPU, slowing down the max throughput and generating extra garbage collection pressure in the process.

It was great that Zing handled the higher allocation rate just fine while concurrent mark/sweep hit long pauses and I wanted to see if this same long pause behavior might happen again if I changed the test.

So I constructed a new test, using 4.0.0 GA, that would bring back some of the garbage collector stress by using term queries matching fewer hits (~170K hits, versus up to ~3M hits before). This would mean more time will be spent hiliting, which is somewhat GC intensive, and less time would be spent retrieving hits, which generates very little garbage. I also increased the near-real-time indexing rate from 100 KB/sec to 2 MB/sec, and ran on the same full English Wikipedia index, but with deleted documents included (net index was ~70% larger in bytes, but the same 33.3 million number of live documents). The net index size was 132G and max heap was 240G (the computer has 512G RAM, plenty!). Each test (one data point in the graphs below) ran for one hour, recording the response time of all queries. I discard the first 5 minute's worth of queries to allow for warmup.

I set concurrent mark/sweep's new generation to 1 GB (-XX:NewSize=1g) because if I didn't, it would strangely sometimes pick a tiny (~18 MB) new generation size which would kill its performance (~10X slower throughput).

Again I tried the experimental G1 collector and it had awful (~130 second) GC hangs, so I left it out.

The first surprising result was the max (saturated) throughput was nearly 2X faster with Zing than with concurrent mark/sweep using RAMDirectory; previously the two had nearly the same maximum throughput:

The graph plots the response time of the slowest query against the actual QPS achieved (total number of queries actually answered in the hour). It's also interesting that MMapDirectory gets higher max throughput than RAMDirectory with concurrent mark/sweep; this is likely because concurrent mark/sweep has higher overhead with a large heap. It could also be in my previous test that concurrent mark/sweep was picking too-small size for the new generation (I had left it at its defaults before).

Here's the same graph, but discarding the over-capacity data points so we can see worst case query latencies below the max throughput:

Again concurrent mark/sweep has non-trivial pauses with both MMapDirectory and RAMDirectory, even at low loads, while Zing remains remarkably flat. What's going on?

Concurrent mark/sweep new generation bottleneck

I suspect in this test the new generation collection is a bottleneck, since it's still a stop-the-world collection, and this explains concurrent mark/sweep's slower max throughput when compared to Zing on RAMDirectory. From the logs, I see the new gen collection running 3-4 times per second and typically taking ~125 msec but sometimes up to 400-500 msec.

Unfortunately, if your application has a high allocation rate, you're stuck between a rock and a hard place: if you tune the new generation size smaller, the throughput drops (this test is already much slower throughput than Zing), but if you tune it larger then you have longer stop-the-world pauses. Zing's C4 collector doesn't have this problem because even its new generation collector is fully concurrent, so it gains the efficiency of a larger new generation without suffering longer pause times.

Furthermore, I think Lucene's segment merges create an allocation pattern that exacerbates the new generation bottleneck: suddenly a large number of long-lived objects are created, causing high-latency new generation collections because these objects are not only copied out of the eden space but must also bounce back and forth in the survivor spaces until they are promoted. This "sudden" arrival of a large number and net size of long-lived objects can cause substantial pauses due to new generation collection.

Concurrent mark/sweep old generation fragmentation

The other, more serious failure mode of concurrent mark/sweep is fragmentation of the old generation. When this strikes, you'll see a "promotion failed" followed by a full (compacting) GC in the GC log which can easily take 10s of seconds on a large heap. I wasn't able to provoke this in my limited testing (maybe I didn't run the tests for long enough), but for a long-running application it's inevitable that eventually you'll hit promotion failed. It's a very real issue, and it comes up frequently on the Solr user's list; here's a recent example.

Azul has a nifty free tool, Fragger, that easily provokes fragmentation and full collection, while only using a small (10% by default) part of the heap and allocating at a light (20 MB/sec default) rate. It "wraps" around any other java command-line, and accelerates testing since it fragments the old gen faster than your application normally would. I ran a quick test, adding Fragger to the search test, and sure enough concurrent mark/sweep hit horrible pauses while Zing was unaffected.

In summary, I remain impressed with Zing, and I wish its C4 collector were the default for Java! Then we all would stop having any GC worries and could freely use Java with very large heaps, fundamentally changing how we build software in this age of very cheap RAM. Zing works fine at its defaults, while concurrent mark/sweep requires substantial tuning, and even then will eventually hit very high pause times due to the new generation bottleneck and old generation fragmentation.

By the way, a very useful JVM option is -XX:+PrintGCApplicationStoppedTime. This prints the net time of every stop-the-world event, like this:
  Total time for which application threads were stopped: 0.4178300 seconds
You can use this to see which stop-the-world phase of your garbage collector is causing your application's hangs.

Friday, September 28, 2012

Lucene's new analyzing suggester

Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago.

In Lucene we have several different suggest implementations, under the suggest module; today I'm describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).

To use it, you provide the set of suggest targets, which is the full set of strings and weights that may be suggested. The targets can come from anywhere; typically you'd process your query logs to create the targets, giving a higher weight to those queries that appear more frequently. If you sell movies you might use all movie titles with a weight according to sales popularity.

You also provide an analyzer, which is used to process each target into analyzed form. Under the hood, the analyzed form is indexed into an FST. At lookup time, the incoming query is processed by the same analyzer and the FST is searched for all completions sharing the analyzed form as a prefix.

Even though the matching is performed on the analyzed form, what's suggested is the original target (i.e., the unanalyzed input). Because Lucene has such a rich set of analyzer components, this can be used to create some useful suggesters:
  • With an analyzer that folds or normalizes case, accents, etc. (e.g., using ICUFoldingFilter), the suggestions will match irrespective of case and accents. For example, the query "ame..." would suggest Amélie.

  • With an analyzer that removes stopwords and normalizes case, the query "ghost..." would suggest "The Ghost of Christmas Past".

  • Even graph TokenStreams, such as SynonymFilter, will work: in such cases we enumerate and index all analyzed paths into the FST. If the analyzer recognizes "wifi" and "wireless network" as synonyms, and you have the suggest target "wifi router" then the user query "wire..." would suggest "wifi router".

  • Japanese suggesters may now be possible, with an analyzer that copies the reading (ReadingAttribute in the Kuromoji analyzer) as its output.

Given the diversity of analyzers, and the easy extensibility for applications to create their own analyzers, I'm sure there are many interesting use cases for this new AnalyzingSuggester: if you have an example please share with us on Lucene's user list (java-user@lucene.apache.org).

While this is a great step forward, there's still plenty to do with Lucene's suggesters. We need to allow for fuzzy matching on the query so we're more robust to typos (there's a rough prototype patch on LUCENE-3846). We need to predict based on only part of the query, instead of insisting on a full prefix match. There are a number of interesting elements to Google's autosuggest that we could draw inspiration from. As always, patches welcome!

Tuesday, August 21, 2012

Lucene's new BlockPostingsFormat, thanks to Google Summer of Code

Another summer wraps up, another successful Google Summer of Code project finished in Apache Lucene!

We had two projects this summer; here I'll describe Han Jiang's project (I was his mentor). The second project, separating StorableField and IndexableField, is also very exciting and wrapping up now.

The goal of Han's project was to create a compelling replacement int-block postings format, improving on the aging variable-int postings format that Lucene has used forever. The project was definitely a success! This morning I merged over the feature branch we had been using all summer, committing to both Lucene's trunk (future 5.x) and Lucene 4.x branch (to be 4.0 GA). This means Lucene 4.0 GA will have the new BlockPostingsFormat available.

The new postings format shows incredible search performance gains versus the current default postings format. I tested with luceneutil on the full May 2012 English Wikipedia index (33.3 million documents, 15 GB index):

TaskQPS baseStdDev baseQPS packedStdDev packed% change
AndHighLow353.099.16317.062.5313%-7%
Fuzzy253.081.0249.583.4114%-1%
Respell41.230.5639.671.378%-0%
PKLookup169.411.94166.691.573%-0%
Fuzzy133.070.8235.240.931%-12%
IntNRQ3.820.014.090.280%-14%
MedSpanNear1.460.031.600.035%-13%
HighSpanNear0.550.010.620.019%-17%
LowSpanNear2.650.063.080.0312%-19%
HighSloppyPhrase0.630.020.740.029%-23%
OrHighHigh3.040.173.550.350%-35%
HighPhrase0.650.000.760.049%-24%
OrHighMed6.620.397.750.800%-37%
LowSloppyPhrase2.340.042.780.0514%-23%
Wildcard17.200.2020.420.7612%-24%
OrHighLow7.290.448.670.920%-39%
AndHighMed23.110.6127.550.3414%-23%
LowPhrase7.440.109.010.4413%-28%
Prefix324.650.3529.941.2314%-28%
MedPhrase3.470.044.240.1815%-28%
MedSloppyPhrase2.030.042.510.0419%-28%
LowTerm145.810.36182.142.0223%-26%
MedTerm51.640.2565.571.1424%-29%
HighTerm8.490.0310.940.2325%-32%
AndHighHigh6.520.198.700.1926%-40%


The gains are awesome! After this change has baked for a while I expect we'll make it the default postings format, maybe in time for 4.1.

This effort has been a long time coming: ever since the creation of flexible indexing, allowing for full pluggability of all files in the index, we have wanted to improve on Lucene's postings format. Early int-block prototypes showed great performance.

The new format stores postings data in fixed blocks of 128 ints. Each block is encoded using a simple packed ints format. The .doc file interleaves docID delta and term frequency blocks, and also includes skip data, while the .pos file holds position deltas. Offsets and payloads, if present, are stored in the .pay file. On the full English Wikipedia index used in the above test this resulted in a nice reduction of disk usage: 13.7 GB (new) vs 14.7 GB (current). That's ~7% smaller.

The format only uses packed int blocks for high-frequency terms: low frequency terms (that appear in less than 128 documents), and the final partial-block of a high frequency term, are stored in the same variable-int format as today's default. Skip points are stored at the start of each block (except the first block).

Han built and tested a number of fun ideas, including skipping within blocks (with partial block decoding), using PForDelta instead of simple packed ints, wrapping with PulsingPostingsFormat, all sorts of different ways to decode packed ints, different values for blockSize and skipMultiplier, encoding all-single-values very compactly, etc. We spent lots of electricity running all sorts of benchmarks. Some ideas worked and some didn't, and things did not go according to the original plan! Such is the healthy unpredictable, zig-zag, iterative nature of open-source development/exploration, yet in the end we find ourselves with a great new postings format.

Thank you Han Jiang for such an important improvement to Lucene, and thank you Google for sponsoring this effort through Google Summer of Code.

Tuesday, July 31, 2012

Lucene index in RAM with Azul's Zing JVM

Google's entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?

RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.

The obvious approach is to load the index into Lucene's RAMDirectory, right?

Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a List of byte[1024] fragments (there are open Jira issues to address this but they haven't been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from RAMDirectory back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using RAMDirectory only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using MMapDirectory (see Uwe's excellent post for more details).

While there are open issues to improve RAMDirectory (LUCENE-4123 and LUCENE-3659), they haven't been committed and many users simply use RAMDirectory anyway.

Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of RAMDirectory should not be a problem for Zing. Let's test it! But first, a quick digression on the importance of measuring search response time of all requests.

Search response time percentiles

Normally our performance testing scripts (luceneutil) measure the average response time, discarding outliers. We do this because we are typically testing an algorithmic change and want to see the effect of that change, ignoring confounding effects due to unrelated pauses from the the OS, IO systems, or GC, etc.

But for a real search application what matters is the total response time of every search request. Search is a fundamentally interactive process: the user sits and waits for the results to come back and then iterates from there. If even 1% of the searches take too long to respond that's a serious problem! Users are impatient and will quickly move on to your competitors.

So I made some improvements to luceneutil, including separating out a load testing client (sendTasks.py) that records the response time of all queries, as well as scripts (loadGraph.py and responseTimeGraph.py) to generate the resulting response-time graphs, and a top-level responseTimeTests.py to run a series of tests at increasing loads (queries/sec), automatically stopping once the load is clearly beyond the server's total capacity. As a nice side effect, this also determines the true capacity (max throughput) of the server rather than requiring an approximate measure by extrapolating from the average search latency.

Queries are sent according to a Poisson distribution, to better model the arrival times of real searches, and the client is thread-less so that if you are testing at 200 queries/sec and the server suddenly pauses for 5 seconds then there will be 1000 queries queued up once it wakes up again (this fixes an unfortunately common bug in load testers that dedicate one thread per simulated client).

The client can run (via password-less ssh) on a separate machine; this is important because if the server machine itself (not just the JVM) is experiencing system-wide pauses (e.g. due to heavy swapping) it can cause pauses in the load testing client which will skew the results. Ideally the client runs on an otherwise idle machine and experiences no pauses. The client even disables Python's cyclic garbage collector to further reduce the chance of pauses.

Wikipedia tests

To test Zing, I first indexed the full Wikipedia English database (as of 5/2/2012), totalling 28.78 GB plain text across 33.3 M 1 KB sized documents, including stored fields and term vectors, so I could highlight the hits using FastVectorHighlighter. The resulting index was 78 GB. For each test, the server loads the entire index into RAMDirectory and then the client sends the top 500 hardest (highest document frequency) TermQuerys, including stop words, to the server. At the same time, the server re-indexes documents (updateDocument) at the rate of 100/sec (~100 KB/sec), and reopens the searcher once per second.

Each test ran for an hour, discarding results for the first 5 minutes to allow for warmup. Max heap is 140 GB (-Xmx 140G). I also tested MMapDirectory, with max heap of 4 GB, as a baseline. The machine has 32 cores (64 with hyper-threading) and 512 GB of RAM, and the server ran with 40 threads.

I tested at varying load rates (queries/sec), from 50 (very easy) to 275 (too hard), and then plotted the resulting response time percentiles for different configurations. The default Oracle GC (Parallel) was clearly horribly slow (10s of seconds collection time) so I didn't include it. The experimental garbage first (G1) collector was even slower starting up (took 6 hours to load the index into RAMDirectory, vs 900 seconds for Concurrent Mark/Sweep (CMS)), and then the queries hit > 100 second latencies, so I also left it out (this was surprising as G1 is targeted towards large heaps). The three configurations I did test were CMS at its defaults settings, with MMapDirectory as the baseline, and both CMS and Zing with RAMDirectory.

At the lightest load (50 QPS), Zing does a good job maintaining low worst-case response time, while CMS shows long worst case response times, even with MMapDirectory:



To see the net capacity of each configuration, I plotted the 99% response time, across different load rates:



From this it's clear that the peak throughput for CMS + MMap was somewhere between 100 and 150 queries/sec, while the RAMDirectory based indices were somewhere between 225 and 250 queries/second. This is an impressive performance gain! It's also interesting because in separately testing RAMDirectory vs MMapDirectory I usually only see minor gains when measuring average query latency.

Plotting the same graph, without CMS + MMapDirectory and removing the 275 queries/second point (since it's over-capacity):



Zing remains incredibly flat at the 99% percentile, while CMS has high response times already at 100 QPS. At 225 queries/sec load, the highest near-capacity rate, for just CMS and Zing on RAMDirectory:



The pause times for CMS are worse than they were at 50 QPS: already at the 95% percentile the response times are too slow (4479 milli-seconds).

Zing works!

It's clear from these tests that Zing really has a very low-pause garbage collector, even at high loads, while managing a 140 GB max heap with 78 GB Lucene index loaded into RAMDirectory. Furthermore, applications can expect a substantial increase in max throughput (around 2X faster in this case) and need not fear using RAMDirectory even for large indices, if they can run under Zing.

Note that Azul has just made the Zing JVM freely available to open-source developers, so now we all can run our own experiments, add Zing into the JVM rotation for our builds, etc.

Next I will test the new DirectPostingsFormat which holds all postings in RAM in simple arrays (no compression) for fast search performance. It requires even more RAM than this test, but gives even faster search performance!

Thank you to Azul for providing a beefy server machine and the Zing JVM to run these tests!

Sunday, July 29, 2012

Building a new Lucene postings format

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene's index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

The trickiest format to create is PostingsFormat, which provides read/write access to all postings (fields, terms, documents, frequencies, positions, offsets, payloads). Part of the challenge is that it has a large API surface area. But there are also complexities such as skipping, reuse, conditional use of different values in the enumeration (frequencies, positions, payloads, offsets), partial consumption of the enumeration, etc. These challenges unfortunately make it easy for bugs to sneak in, but an awesome way to ferret out all the bugs is to leverage Lucene's extensive randomized tests: run all tests with -Dtests.postingsformat=XXX (be sure to first register your new postings format). If your new postings format has a bug, tests will most likely fail.

However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.

I recently created a new postings format, BlockPostingsFormat, which will hopefully be more efficient than the Sep codec at using fixed int block encodings. I did this to support Han Jiang's Google Summer of Code project to add a useful int block postings format to Lucene.

So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test, TestPostingsFormat. It has layers of testing (documents, +freqs, +positions, +payloads, +offsets) that you can incrementally enable as you iterate, as well as different test options (skipping or not, reuse or not, stop visiting documents and/or positions early, one or more threads, etc.). When you turn on verbose (-Dtests.verbose=true) the test prints clear details of everything it indexed and what exactly it's testing so a failure is easy to debug. I'm very happy with the results: I found this to be a much more productive way to create a new postings format.

The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in TestPostingsFormat! (Who tests the tester?)

If you find yourself creating a new postings format I strongly suggest using the new TestPostingsFormat during early development to get your postings format off the ground. Once it's passing, run all tests with your new postings format, and if something fails please let us know so we can fix TestPostingsFormat.