Sunday, February 24, 2013

Drill Sideways faceting with Lucene

Lucene's facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There's been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:



That's a nearly 3.8X speedup!

But beyond speedups there are also new features, and here I'll describe the new DrillSideways class being developed under LUCENE-4748.

In the typical faceted search UI, you'll see a column to the left of your search results showing each field (Price, Manufacturer, etc.) enabled for refinement, displaying the unique values or ranges along with their counts. The count tells you how many matches from your current search you'll see if you drill down on that value or range, by clicking it and adding that filter to your current search. If you go shopping for Internal SSDs at Newegg you'll see something like the drill down on the left.

If you click on one of those values at Newegg, say 129GB - 256GB, you'll notice that the field then disappears from the drill down options and jumps to the top of the page, as part of the search breadcrumb, where the only thing you can do with it is drill up by clicking the (X) to remove the filter. What if you drill down on several fields, but then want to explore alternate values for those fields? You can't really do that: you're stuck. This is a poor user experience, and I often find myself making excessive use of the back button to explore different options on sites that only offer drill down and drill up: the UI is too restrictive. Overstock.com is another example of a pure drill down/up UI.

Other sites improve on this by offering drill sideways, or the option to pick alternative or additional facet values for a field you've previously drilled down on.

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn's faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or'd value to filter on (in the multi-select case).

Whether the field offers drill sideways is orthogonal to whether the field is multi-select. For example, look at the date field at Search-lucene.com: it's a single-select field, but allows for drill sideways (unfortunately the counts seem to be buggy!).

Implementing Drill Sideways

How can one implement drill sideways? It's tricky because, once you've drilled into a specific value for a given field, the other values in that field will of course have zero drill down count since they were filtered out of the result set. So how do you compute those other counts?

A straightforward solution is the "hold one out" approach: if the original query is "foo", and you've drilled down on three fields (say A:a, B:b and C:c), then to compute the facet count for each of the fields A, B and C you re-run the query, computing drill down counts, each time with only the other fields filtered.

For example, the facet counts for A are the drill down counts from the query +foo +B:b +C:c (A:a is left out), while the facet counts for B are the drill down counts from the query +foo +A:a +C:c (B:b is left out), etc. You must also separately run the query with all filters (+foo +A:a +B:b +C:c) to get the actual hits. With Solr you explicitly specify this hold-on-out yourself as part of the facet request.

While this approach will produce the correct counts, it's costly because now you're running 4 queries instead of 1. You may be able to speed this up by using bitsets instead: produce a bitset of all matching documents for just the query "foo", then a bitset for each of the drill downs A, B and C, and then intersect the bitsets as above, and count facets for the resulting docs. Solr uses an approach like this, caching the bitsets across requests.

Yet another approach, and the one we took on LUCENE-4748, is to execute a single query that matches both hits as well as near-misses, which are documents that matched all but one of the filters. This is actually a standard BooleanQuery, where the original query is a MUST clause, and each filter is a SHOULD clause, with minNumberShouldMatch set to the number of drill down filters minus 1. Then, when each match is collected, if all filters match, it's a true hit and we collect the document accordingly and increment the drill down counts. But if it's a near-miss, because exactly one filter failed, then we instead increment drill sideways counts against that one field that didn't match.

Note that when you first drill down on a given field, the resulting drill sideways counts for that field will be the same as the drill down counts just before you drilled down. If the UI can save this state then a small optimization would be to avoid counting the drill sideways counts for the just-drilled-down field, and instead reuse the previous drill down counts.

The DrillSideways class is obviously very new, so I'm sure there are some exciting bugs in it (though it does have a good randomized test). Feedback and patches are welcome!

Saturday, January 26, 2013

Getting real-time field values in Lucene

We know Lucene's near-real-time search is very fast: you can easily refresh your searcher once per second, even at high indexing rates, so that any change to the index is available for searching or faceting at most one second later. For most applications this is plenty fast.

But what if you sometimes need even better than near-real-time? What if you need to look up truly live or real-time values, so for any document id you can retrieve the very last value indexed?

Just use the newly committed LiveFieldValues class!

It's simple to use: when you instantiate it you provide it with your SearcherManager or NRTManager, so that it can subscribe to the RefreshListener to be notified when new searchers are opened, and then whenever you add, update or delete a document, you notify the LiveFieldValues instance. Finally, call the get method to get the last indexed value for a given document id.

This class is simple inside: it holds the values of recently indexed documents in a ConcurrentHashMap, keyed by the document id, to hold documents that were just indexed but not yet available through the near-real-time searcher. Whenever a new near-real-time searcher is successfully opened, it clears the map of all entries that are now included in that searcher. It carefully handles the transition time from when the reopen started to when it finished by checking two maps for the possible value, and failing that, it falls back to the current searcher.

LiveFieldValues is abstract: you must subclass it and implement the lookupFromSearcher method to retrieve a document's value from an IndexSearcher, since how your application stores the values in the searcher is application dependent (stored fields, doc values or even postings, payloads or term vectors).

Note that this class only offers "live get", i.e. you can get the last indexed value for any document, but it does not offer "live search", i.e. you cannot search against the value until the searcher is reopened. Also, the internal maps are only pruned after a new searcher is opened, so RAM usage will grow unbounded if you never reopen! It's up to your application to ensure that the same document id is never updated simultaneously (in different threads) because in that case you cannot know which update "won" (Lucene does not expose this information, although LUCENE-3424 is one possible solution for this).

An example use-case is to store a version field per document so that you know the last version indexed for a given id; you can then use this to reject a later but out-of-order update for that same document whose version is older than the version already indexed.

LiveFieldValues will be available in the next Lucene release (4.2).

Thursday, January 10, 2013

Taming Text is released!

There's a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.

I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don't follow a few simple, easily codified rules and are still not fully understood today.

The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!

N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.

The final chapter, "Untamed Text", is especially fun: the sections, some of which are contributed by additional authors, address very challenging topics like semantics extraction, document summarization, relationship extraction, identifying important content and people, detecting emotions with sentiment analysis and cross-language information retrieval.

There were a few topics I expected to see but seemed to be missing. There was no coverage of the Unicode standard (e.g. encodings, and useful standards such as UAX#29 text segmentation). Multi-lingual issues were not addressed; all examples are English. Finite-state transducers were also missing, even though these are powerful tools for text processing. Lucene uses FSTs in a number of places: efficient synonym-filtering, character filtering during analysis, fast auto-suggest, tokenizing Japanese text, in-memory postings format. Still, it's fine that some topics are missing: text processing is an immense field and something has to be cut!

The book is unfortunately based on Lucene/Solr 3.x, so new features only in Lucene/Solr 4.0 are missing, for example the new DirectSpellChecker, scoring models beyond TF/IDF Vector Space Model. Chapter 4, Fuzzy text searching, didn't mention Lucene's new FuzzyQuery nor the very fast Levenshtein Automata approach it uses for finding all fuzzy matches from a large set of terms.

All in all the book is a great introduction to how to leverage numerous open-source tools to process text.

Sunday, December 30, 2012

A new Lucene highlighter is born

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it's the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google's Chrome browser has an ingenious solution to the final inch problem, when you use "Find..." to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn't store the full token graph.

Unlike the existing highlighters, which rely on term-vectors or on re-analysis of each matched document to obtain the per-token offsets, PostingsHighlighter uses the recently added postings offsets feature. To index postings offsets you must set the field to be highlighted to use FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS option during indexing.

It turns out postings offsets is much more efficient storage for offsets because the default codec (currently Lucene41) does a good job compressing them: ~1.1 byte per position, which includes both start and end offset. In contrast, term vectors require substantially more disk space (~7.8X for the 10 million document English Wikipedia index), slow down indexing and merging, and are slow to access at search time. A smaller index also means the "working set" size, i.e. the net number of bytes that your search application frequently hits from disk, will be smaller, so you'll need less RAM to keep the index hot.

PostingsHighlighter uses a BreakIterator to find passages in the text; by default it breaks using getSentenceIterator. It then iterates in parallel (merge sorting by offset) through the positions of all terms from the query, coalescing those hits that occur in a single passage into a Passage, and then scores each Passage using a separate PassageScorer.

The scoring model is fun: it treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. The default PassageScorer uses BM25 scoring, biased with a normalization factor that favors passages occurring closer to the start of the document, but it's pluggable so you can implement your own scoring (and feel free to share if you find an improvement!).

This new highlighter should be substantially faster than our existing highlighters on a cold index (when the index doesn't fit entirely into available RAM), as it does more sequential IO instead of seek-heavy random access. Furthermore, as you increase the number of top hits, the performance gains should be even better. Also, the larger the documents the better the performance gains should be.

One known limitation is that it can only highlight a single field at a time, i.e. you cannot pass it N fields and have it pick the best passages across all of them, though both existing highlighters have the same limitation. The code is very new and may still have some exciting bugs! This is why it's located under Lucene's sandbox module.

If you are serious about highlighting in your search application (and you should be!) then PostingsHighlighter is well worth a look!

Sunday, December 9, 2012

Fun with Lucene's faceted search module

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene's facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice "getting started" examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I'm sure there are more...

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Lucene's nightly performance benchmarks

I was curious about the performance of faceted search, so I added date facets, indexed as year/month/day hierarchy, to the nightly Lucene benchmarks. Specifically I added faceting to all TermQuerys that were already tested, and now we can watch this graph to track our faceted search performance over time. The date field is the timestamp of the most recent revision of each Wikipedia page.

Simple performance tests

I also ran some simple initial tests on a recent (5/2/2012) English Wikipedia export, which contains 30.2 GB of plain text across 33.3 million documents. By default, faceted search retrieves the counts of all facet values under the root node (years, in this case):
     Date (3994646)
       2012 (1990192)
       2011 (752327)
       2010 (380977)
       2009 (275152)
       2008 (271543)
       2007 (211688)
       2006 (98809)
       2005 (12846)
       2004 (1105)
       2003 (7)
It's interesting that 2012 has such a high count, even though this export only includes the first five months and two days of 2012. Wikipedia's pages are very actively edited!

The search index with facets grew only slightly (~2.3%, from 12.5 GB to 12.8 GB) because of the additional indexed facet field. The taxonomy index, which is a separate index used to map facets to fixed integer codes, was tiny: only 120 KB. The more unique facet values you have, the larger this index will be.

Next I compared search performance with and without faceting. A simple TermQuery (party), matching just over a million hits, was 51.2 queries per second (QPS) without facets and 3.4 QPS with facets. While this is a somewhat scary slowdown, it's the worst case scenario: TermQuery is very cheap to execute, and can easily match a large number of hits. The cost of faceting is in proportion to the number of hits. It would be nice to speed this up (patches welcome!).

I also tested a harder PhraseQuery ("the village"), matching 194 K hits: 3.8 QPS without facets and 2.8 QPS with facets, which is less of a hit because PhraseQuery takes more work to match each hit and generally matches fewer hits.

Loading facet data in RAM

For the above results I used the facet defaults, where the per-document facet values are left on disk during aggregation. If you have enough RAM you can also load all facet values into RAM using the CategoryListCache class. I tested this, and it gave nice speedups: the TermQuery was 73% faster (to 6.0 QPS) and the PhraseQuery was 19% faster.

However, there are downsides: it's time-consuming to initialize (4.6 seconds in my test), and not NRT-friendly, though this shouldn't be so hard to fix (patches welcome!). It also required a substantial 1.9 GB RAM, according to Lucene's RamUsageEstimator. We should be able to reduce this RAM usage by switching to Lucene's fast packed ints implementation from the current int[][] it uses today, or by using DocValues to hold the per-document facet data. I just opened LUCENE-4602 to explore DocValues and initial results look very promising.

Sampling

Next I tried sampling, where the facet module visits 1% of the hits (by default) and only aggregates counts for those. In the default mode, this sampling is used only to find the top N facet values, and then a second pass computes the correct count for each of those values. This is a good fit when the taxonomy is wide and flat, and counts are pretty evenly distributed. I tested that, but results were slower, because the date taxonomy is not wide and flat and has rather lopsided counts (2012 has the majority of hits).

You can also skip the second pass and then present approximate counts or a percentage value to the user. I tested that and saw sizable gains: the TermQuery was 248% (2.5X) faster (to 12.2 QPS) and the PhraseQuery was 29% faster (to 3.6 QPS). The sampling is also quite configurable: you can set the min and max sample sizes, the sample ratio, the threshold under which no sampling should happen, etc.

Lucene's facet module makes it trivial to add facets to your search application, and offers useful features like sampling, alternative aggregates, complements, RAM caching, and fully customizable interfaces for many aspects of faceting. I'm hopeful we can reduce the RAM consumption for caching, and speed up the overall performance, over time.

Monday, November 19, 2012

Lucene with Zing, Part 2

When I last tested Lucene with the Zing JVM the results were impressive: Zing's fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene's 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory's max throughput was now better than RAMDirectory's (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory's performance, and lower the pressure on concurrent mark/sweep's GC to the point where pause times were so much lower in GA compared to alpha?

Fortunately, Zing includes a nice tool, gcLogAnalyser, to visualize all sorts of data extracted from its GC logs. Like Zing's JVM, this tool is also free for open-source development. To enable GC logging you should pass -verbose:gc and -XX:+PrintGCDetails to the Zing JVM, and it's also a good idea to redirect all GC output to a separate file using -Xloggc so that any other output isn't mixed in.

Using gcLogAnalyser it was clear that, for my test, Lucene 4.0.0 alpha had a much higher allocation rate compared to 4.0.0 GA, and from that hint I isolated the difference to this sneaky little performance bug (LUCENE-4289). That issue, present in 4.0.0 alpha and now fixed in 4.0.0 GA, didn't break anything but was causing FastVectorHighlighter to do far too many term lookups internally, consuming too much CPU, slowing down the max throughput and generating extra garbage collection pressure in the process.

It was great that Zing handled the higher allocation rate just fine while concurrent mark/sweep hit long pauses and I wanted to see if this same long pause behavior might happen again if I changed the test.

So I constructed a new test, using 4.0.0 GA, that would bring back some of the garbage collector stress by using term queries matching fewer hits (~170K hits, versus up to ~3M hits before). This would mean more time will be spent hiliting, which is somewhat GC intensive, and less time would be spent retrieving hits, which generates very little garbage. I also increased the near-real-time indexing rate from 100 KB/sec to 2 MB/sec, and ran on the same full English Wikipedia index, but with deleted documents included (net index was ~70% larger in bytes, but the same 33.3 million number of live documents). The net index size was 132G and max heap was 240G (the computer has 512G RAM, plenty!). Each test (one data point in the graphs below) ran for one hour, recording the response time of all queries. I discard the first 5 minute's worth of queries to allow for warmup.

I set concurrent mark/sweep's new generation to 1 GB (-XX:NewSize=1g) because if I didn't, it would strangely sometimes pick a tiny (~18 MB) new generation size which would kill its performance (~10X slower throughput).

Again I tried the experimental G1 collector and it had awful (~130 second) GC hangs, so I left it out.

The first surprising result was the max (saturated) throughput was nearly 2X faster with Zing than with concurrent mark/sweep using RAMDirectory; previously the two had nearly the same maximum throughput:

The graph plots the response time of the slowest query against the actual QPS achieved (total number of queries actually answered in the hour). It's also interesting that MMapDirectory gets higher max throughput than RAMDirectory with concurrent mark/sweep; this is likely because concurrent mark/sweep has higher overhead with a large heap. It could also be in my previous test that concurrent mark/sweep was picking too-small size for the new generation (I had left it at its defaults before).

Here's the same graph, but discarding the over-capacity data points so we can see worst case query latencies below the max throughput:

Again concurrent mark/sweep has non-trivial pauses with both MMapDirectory and RAMDirectory, even at low loads, while Zing remains remarkably flat. What's going on?

Concurrent mark/sweep new generation bottleneck

I suspect in this test the new generation collection is a bottleneck, since it's still a stop-the-world collection, and this explains concurrent mark/sweep's slower max throughput when compared to Zing on RAMDirectory. From the logs, I see the new gen collection running 3-4 times per second and typically taking ~125 msec but sometimes up to 400-500 msec.

Unfortunately, if your application has a high allocation rate, you're stuck between a rock and a hard place: if you tune the new generation size smaller, the throughput drops (this test is already much slower throughput than Zing), but if you tune it larger then you have longer stop-the-world pauses. Zing's C4 collector doesn't have this problem because even its new generation collector is fully concurrent, so it gains the efficiency of a larger new generation without suffering longer pause times.

Furthermore, I think Lucene's segment merges create an allocation pattern that exacerbates the new generation bottleneck: suddenly a large number of long-lived objects are created, causing high-latency new generation collections because these objects are not only copied out of the eden space but must also bounce back and forth in the survivor spaces until they are promoted. This "sudden" arrival of a large number and net size of long-lived objects can cause substantial pauses due to new generation collection.

Concurrent mark/sweep old generation fragmentation

The other, more serious failure mode of concurrent mark/sweep is fragmentation of the old generation. When this strikes, you'll see a "promotion failed" followed by a full (compacting) GC in the GC log which can easily take 10s of seconds on a large heap. I wasn't able to provoke this in my limited testing (maybe I didn't run the tests for long enough), but for a long-running application it's inevitable that eventually you'll hit promotion failed. It's a very real issue, and it comes up frequently on the Solr user's list; here's a recent example.

Azul has a nifty free tool, Fragger, that easily provokes fragmentation and full collection, while only using a small (10% by default) part of the heap and allocating at a light (20 MB/sec default) rate. It "wraps" around any other java command-line, and accelerates testing since it fragments the old gen faster than your application normally would. I ran a quick test, adding Fragger to the search test, and sure enough concurrent mark/sweep hit horrible pauses while Zing was unaffected.

In summary, I remain impressed with Zing, and I wish its C4 collector were the default for Java! Then we all would stop having any GC worries and could freely use Java with very large heaps, fundamentally changing how we build software in this age of very cheap RAM. Zing works fine at its defaults, while concurrent mark/sweep requires substantial tuning, and even then will eventually hit very high pause times due to the new generation bottleneck and old generation fragmentation.

By the way, a very useful JVM option is -XX:+PrintGCApplicationStoppedTime. This prints the net time of every stop-the-world event, like this:
  Total time for which application threads were stopped: 0.4178300 seconds
You can use this to see which stop-the-world phase of your garbage collector is causing your application's hangs.

Friday, September 28, 2012

Lucene's new analyzing suggester

Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago.

In Lucene we have several different suggest implementations, under the suggest module; today I'm describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).

To use it, you provide the set of suggest targets, which is the full set of strings and weights that may be suggested. The targets can come from anywhere; typically you'd process your query logs to create the targets, giving a higher weight to those queries that appear more frequently. If you sell movies you might use all movie titles with a weight according to sales popularity.

You also provide an analyzer, which is used to process each target into analyzed form. Under the hood, the analyzed form is indexed into an FST. At lookup time, the incoming query is processed by the same analyzer and the FST is searched for all completions sharing the analyzed form as a prefix.

Even though the matching is performed on the analyzed form, what's suggested is the original target (i.e., the unanalyzed input). Because Lucene has such a rich set of analyzer components, this can be used to create some useful suggesters:
  • With an analyzer that folds or normalizes case, accents, etc. (e.g., using ICUFoldingFilter), the suggestions will match irrespective of case and accents. For example, the query "ame..." would suggest Amélie.

  • With an analyzer that removes stopwords and normalizes case, the query "ghost..." would suggest "The Ghost of Christmas Past".

  • Even graph TokenStreams, such as SynonymFilter, will work: in such cases we enumerate and index all analyzed paths into the FST. If the analyzer recognizes "wifi" and "wireless network" as synonyms, and you have the suggest target "wifi router" then the user query "wire..." would suggest "wifi router".

  • Japanese suggesters may now be possible, with an analyzer that copies the reading (ReadingAttribute in the Kuromoji analyzer) as its output.

Given the diversity of analyzers, and the easy extensibility for applications to create their own analyzers, I'm sure there are many interesting use cases for this new AnalyzingSuggester: if you have an example please share with us on Lucene's user list (java-user@lucene.apache.org).

While this is a great step forward, there's still plenty to do with Lucene's suggesters. We need to allow for fuzzy matching on the query so we're more robust to typos (there's a rough prototype patch on LUCENE-3846). We need to predict based on only part of the query, instead of insisting on a full prefix match. There are a number of interesting elements to Google's autosuggest that we could draw inspiration from. As always, patches welcome!

Tuesday, August 21, 2012

Lucene's new BlockPostingsFormat, thanks to Google Summer of Code

Another summer wraps up, another successful Google Summer of Code project finished in Apache Lucene!

We had two projects this summer; here I'll describe Han Jiang's project (I was his mentor). The second project, separating StorableField and IndexableField, is also very exciting and wrapping up now.

The goal of Han's project was to create a compelling replacement int-block postings format, improving on the aging variable-int postings format that Lucene has used forever. The project was definitely a success! This morning I merged over the feature branch we had been using all summer, committing to both Lucene's trunk (future 5.x) and Lucene 4.x branch (to be 4.0 GA). This means Lucene 4.0 GA will have the new BlockPostingsFormat available.

The new postings format shows incredible search performance gains versus the current default postings format. I tested with luceneutil on the full May 2012 English Wikipedia index (33.3 million documents, 15 GB index):

TaskQPS baseStdDev baseQPS packedStdDev packed% change
AndHighLow353.099.16317.062.5313%-7%
Fuzzy253.081.0249.583.4114%-1%
Respell41.230.5639.671.378%-0%
PKLookup169.411.94166.691.573%-0%
Fuzzy133.070.8235.240.931%-12%
IntNRQ3.820.014.090.280%-14%
MedSpanNear1.460.031.600.035%-13%
HighSpanNear0.550.010.620.019%-17%
LowSpanNear2.650.063.080.0312%-19%
HighSloppyPhrase0.630.020.740.029%-23%
OrHighHigh3.040.173.550.350%-35%
HighPhrase0.650.000.760.049%-24%
OrHighMed6.620.397.750.800%-37%
LowSloppyPhrase2.340.042.780.0514%-23%
Wildcard17.200.2020.420.7612%-24%
OrHighLow7.290.448.670.920%-39%
AndHighMed23.110.6127.550.3414%-23%
LowPhrase7.440.109.010.4413%-28%
Prefix324.650.3529.941.2314%-28%
MedPhrase3.470.044.240.1815%-28%
MedSloppyPhrase2.030.042.510.0419%-28%
LowTerm145.810.36182.142.0223%-26%
MedTerm51.640.2565.571.1424%-29%
HighTerm8.490.0310.940.2325%-32%
AndHighHigh6.520.198.700.1926%-40%


The gains are awesome! After this change has baked for a while I expect we'll make it the default postings format, maybe in time for 4.1.

This effort has been a long time coming: ever since the creation of flexible indexing, allowing for full pluggability of all files in the index, we have wanted to improve on Lucene's postings format. Early int-block prototypes showed great performance.

The new format stores postings data in fixed blocks of 128 ints. Each block is encoded using a simple packed ints format. The .doc file interleaves docID delta and term frequency blocks, and also includes skip data, while the .pos file holds position deltas. Offsets and payloads, if present, are stored in the .pay file. On the full English Wikipedia index used in the above test this resulted in a nice reduction of disk usage: 13.7 GB (new) vs 14.7 GB (current). That's ~7% smaller.

The format only uses packed int blocks for high-frequency terms: low frequency terms (that appear in less than 128 documents), and the final partial-block of a high frequency term, are stored in the same variable-int format as today's default. Skip points are stored at the start of each block (except the first block).

Han built and tested a number of fun ideas, including skipping within blocks (with partial block decoding), using PForDelta instead of simple packed ints, wrapping with PulsingPostingsFormat, all sorts of different ways to decode packed ints, different values for blockSize and skipMultiplier, encoding all-single-values very compactly, etc. We spent lots of electricity running all sorts of benchmarks. Some ideas worked and some didn't, and things did not go according to the original plan! Such is the healthy unpredictable, zig-zag, iterative nature of open-source development/exploration, yet in the end we find ourselves with a great new postings format.

Thank you Han Jiang for such an important improvement to Lucene, and thank you Google for sponsoring this effort through Google Summer of Code.

Tuesday, July 31, 2012

Lucene index in RAM with Azul's Zing JVM

Google's entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?

RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.

The obvious approach is to load the index into Lucene's RAMDirectory, right?

Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a List of byte[1024] fragments (there are open Jira issues to address this but they haven't been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from RAMDirectory back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using RAMDirectory only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using MMapDirectory (see Uwe's excellent post for more details).

While there are open issues to improve RAMDirectory (LUCENE-4123 and LUCENE-3659), they haven't been committed and many users simply use RAMDirectory anyway.

Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of RAMDirectory should not be a problem for Zing. Let's test it! But first, a quick digression on the importance of measuring search response time of all requests.

Search response time percentiles

Normally our performance testing scripts (luceneutil) measure the average response time, discarding outliers. We do this because we are typically testing an algorithmic change and want to see the effect of that change, ignoring confounding effects due to unrelated pauses from the the OS, IO systems, or GC, etc.

But for a real search application what matters is the total response time of every search request. Search is a fundamentally interactive process: the user sits and waits for the results to come back and then iterates from there. If even 1% of the searches take too long to respond that's a serious problem! Users are impatient and will quickly move on to your competitors.

So I made some improvements to luceneutil, including separating out a load testing client (sendTasks.py) that records the response time of all queries, as well as scripts (loadGraph.py and responseTimeGraph.py) to generate the resulting response-time graphs, and a top-level responseTimeTests.py to run a series of tests at increasing loads (queries/sec), automatically stopping once the load is clearly beyond the server's total capacity. As a nice side effect, this also determines the true capacity (max throughput) of the server rather than requiring an approximate measure by extrapolating from the average search latency.

Queries are sent according to a Poisson distribution, to better model the arrival times of real searches, and the client is thread-less so that if you are testing at 200 queries/sec and the server suddenly pauses for 5 seconds then there will be 1000 queries queued up once it wakes up again (this fixes an unfortunately common bug in load testers that dedicate one thread per simulated client).

The client can run (via password-less ssh) on a separate machine; this is important because if the server machine itself (not just the JVM) is experiencing system-wide pauses (e.g. due to heavy swapping) it can cause pauses in the load testing client which will skew the results. Ideally the client runs on an otherwise idle machine and experiences no pauses. The client even disables Python's cyclic garbage collector to further reduce the chance of pauses.

Wikipedia tests

To test Zing, I first indexed the full Wikipedia English database (as of 5/2/2012), totalling 28.78 GB plain text across 33.3 M 1 KB sized documents, including stored fields and term vectors, so I could highlight the hits using FastVectorHighlighter. The resulting index was 78 GB. For each test, the server loads the entire index into RAMDirectory and then the client sends the top 500 hardest (highest document frequency) TermQuerys, including stop words, to the server. At the same time, the server re-indexes documents (updateDocument) at the rate of 100/sec (~100 KB/sec), and reopens the searcher once per second.

Each test ran for an hour, discarding results for the first 5 minutes to allow for warmup. Max heap is 140 GB (-Xmx 140G). I also tested MMapDirectory, with max heap of 4 GB, as a baseline. The machine has 32 cores (64 with hyper-threading) and 512 GB of RAM, and the server ran with 40 threads.

I tested at varying load rates (queries/sec), from 50 (very easy) to 275 (too hard), and then plotted the resulting response time percentiles for different configurations. The default Oracle GC (Parallel) was clearly horribly slow (10s of seconds collection time) so I didn't include it. The experimental garbage first (G1) collector was even slower starting up (took 6 hours to load the index into RAMDirectory, vs 900 seconds for Concurrent Mark/Sweep (CMS)), and then the queries hit > 100 second latencies, so I also left it out (this was surprising as G1 is targeted towards large heaps). The three configurations I did test were CMS at its defaults settings, with MMapDirectory as the baseline, and both CMS and Zing with RAMDirectory.

At the lightest load (50 QPS), Zing does a good job maintaining low worst-case response time, while CMS shows long worst case response times, even with MMapDirectory:



To see the net capacity of each configuration, I plotted the 99% response time, across different load rates:



From this it's clear that the peak throughput for CMS + MMap was somewhere between 100 and 150 queries/sec, while the RAMDirectory based indices were somewhere between 225 and 250 queries/second. This is an impressive performance gain! It's also interesting because in separately testing RAMDirectory vs MMapDirectory I usually only see minor gains when measuring average query latency.

Plotting the same graph, without CMS + MMapDirectory and removing the 275 queries/second point (since it's over-capacity):



Zing remains incredibly flat at the 99% percentile, while CMS has high response times already at 100 QPS. At 225 queries/sec load, the highest near-capacity rate, for just CMS and Zing on RAMDirectory:



The pause times for CMS are worse than they were at 50 QPS: already at the 95% percentile the response times are too slow (4479 milli-seconds).

Zing works!

It's clear from these tests that Zing really has a very low-pause garbage collector, even at high loads, while managing a 140 GB max heap with 78 GB Lucene index loaded into RAMDirectory. Furthermore, applications can expect a substantial increase in max throughput (around 2X faster in this case) and need not fear using RAMDirectory even for large indices, if they can run under Zing.

Note that Azul has just made the Zing JVM freely available to open-source developers, so now we all can run our own experiments, add Zing into the JVM rotation for our builds, etc.

Next I will test the new DirectPostingsFormat which holds all postings in RAM in simple arrays (no compression) for fast search performance. It requires even more RAM than this test, but gives even faster search performance!

Thank you to Azul for providing a beefy server machine and the Zing JVM to run these tests!

Sunday, July 29, 2012

Building a new Lucene postings format

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene's index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

The trickiest format to create is PostingsFormat, which provides read/write access to all postings (fields, terms, documents, frequencies, positions, offsets, payloads). Part of the challenge is that it has a large API surface area. But there are also complexities such as skipping, reuse, conditional use of different values in the enumeration (frequencies, positions, payloads, offsets), partial consumption of the enumeration, etc. These challenges unfortunately make it easy for bugs to sneak in, but an awesome way to ferret out all the bugs is to leverage Lucene's extensive randomized tests: run all tests with -Dtests.postingsformat=XXX (be sure to first register your new postings format). If your new postings format has a bug, tests will most likely fail.

However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.

I recently created a new postings format, BlockPostingsFormat, which will hopefully be more efficient than the Sep codec at using fixed int block encodings. I did this to support Han Jiang's Google Summer of Code project to add a useful int block postings format to Lucene.

So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test, TestPostingsFormat. It has layers of testing (documents, +freqs, +positions, +payloads, +offsets) that you can incrementally enable as you iterate, as well as different test options (skipping or not, reuse or not, stop visiting documents and/or positions early, one or more threads, etc.). When you turn on verbose (-Dtests.verbose=true) the test prints clear details of everything it indexed and what exactly it's testing so a failure is easy to debug. I'm very happy with the results: I found this to be a much more productive way to create a new postings format.

The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in TestPostingsFormat! (Who tests the tester?)

If you find yourself creating a new postings format I strongly suggest using the new TestPostingsFormat during early development to get your postings format off the ground. Once it's passing, run all tests with your new postings format, and if something fails please let us know so we can fix TestPostingsFormat.

Tuesday, July 3, 2012

Lucene 4.0.0 alpha, at long last!

The 4.0.0 alpha release of Lucene and Solr is finally out!

This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:
  • All deprecated APIs as of 3.6.0 have been removed.

  • Pre-3.0 indices are no longer supported.

  • MIGRATE.txt describes how to update your application code.

  • The index format won't change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta.

Please try the release and report back!

Pluggable Codec

The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.

There are some fun core codecs:
  • Lucene40 is the default codec.

  • Lucene3x (read-only) reads any index written with Lucene 3.x.

  • SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!).

  • MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.)

  • PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup.

  • AppendingCodec avoids seeking while writing, necessary for file-systems such as Hadoop DFS.

If you create your own Codec it's easy to confirm all of Lucene/Solr's tests pass with it. If tests fail then likely your Codec has a bug!

A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.

Flexible scoring

Lucene's scoring is now fully pluggable, with the TF/IDF vector space model remaining as the default. You can create your own scoring model, or use one of the core scoring models (BM25, Divergence from Randomness, Language Models, and Information-based models). Per-document normalization values are no longer limited to a single byte. Various new aggregate statistics are now available.

These changes were part of a 2011 Google Summer of Code project (thank you David!).

These two changes are really important because they remove the barriers to ongoing innovations. Now it's easy to experiment with wild changes to the index format or to Lucene's scoring models. A recent example of such innovation is this neat codec by the devs at Flax to enable updatable fields by storing postings in a Redis key/value store.

Document Values

The new document values API stores strongly typed single-valued fields per document, meant as an eventual replacement for Lucene's field cache. The values are pre-computed during indexing and stored in the index in a column-stride format (values for a single field across all documents are stored together), making it much faster to initialize at search time than the field cache. Values can be fixed 8, 16, 32, 64 bit ints, or variable-bits sized (packed) ints; float or double; and six flavors of byte[] (fixed size or variable sized; dereferenced, straight or sorted).

New Field APIs

The API for creating document fields has changed: Fieldable and AbstractField have been removed, and a new FieldType, factored out of Field class, holds details about how the field's value should be indexed. New classes have been created for specific commonly-used fields:
  • StringField indexes a string as a single token, without norms and as docs only. For example, use this for a primary key (id) field, or for a field you will sort on.

  • TextField indexes the fully tokenized string, with norms and including docs, term frequencies and positions. For example, use this for the primary text field.

  • StoredField is a field whose value is just stored.

  • XXXDocValuesField create typed document values fields.

  • IntField, FloaField, LongField, DoubleField create typed numeric fields for efficient range queries and filters.

If none of these field classes apply, you can always create your own FieldType (typically by starting from the exposed FieldTypes from the above classes and then tweaking), and then construct a Field by passing the name, FieldType and value.

Note that the old APIs (using Index, Store, TermVector enums) are still present (deprecated), to ease migration.

These changes were part of a 2011 Google Summer of Code project (thank you Nikola!).

Other big changes

Lucene's terms are now binary (arbitrary byte[]); by default they are UTF-8 encoded strings, sorted in Unicode sort order. But your Analyzer is free to produce tokens with an arbitrary byte[] (e.g., CollationKeyAnalyzer does so).

A new DirectSpellChecker finds suggestions directly from any Lucene index, avoiding the hassle of maintaining a sidecar spellchecker index. It uses the same fast Levenshtein automata as FuzzyQuery (see below).

Term offsets (the start and end character position of each term) may now be stored in the postings, by using FieldInfo.IndexOption.DOCS_AND_POSITIONS_AND_OFFSETS when indexing the field. I expect this will be useful for fast highlighting without requiring term vectors, but this part is not yet done (patches welcome!).

A new AutomatonQuery matches all documents containing any term matching a provided automaton. Both WildcardQuery and RegexpQuery simply construct the corresponding automaton and then run AutomatonQuery. The classic QueryParser produces a RegexpQuery if you type fieldName:/expression/ or /expression against default field/.

Optimizations

Beyond the fun new features there are some incredible performance gains.

If you use FuzzyQuery, you should see a factor of 100-200 speedup on moderately sized indices.

If you search with a Filter, you can see gains up to 3X faster (depending on filter density and query complexity), thanks to a change that applies filters just like we apply deleted documents.

If you use multiple threads for indexing, you should see stunning throughput gains (265% in that case), thanks to concurrent flushing. You are also now able to use more than 2048 MB IndexWriter RAM buffer (as long as you use multiple threads).

The new BlockTree default terms dictionary uses far less RAM to hold the terms index, and can sometimes avoid going to disk for terms that do not exist. In addition, the field cache also uses substantially less RAM, by avoiding separate objects per document and instead packing character data into shared byte[] blocks. Together this results in a 73% reduction in RAM required for searching in one case.

IndexWriter now buffers term data using byte[] instead of char[], using half the RAM for ASCII terms.

MultiTermQuery now rewrites per-segment, and caches per-term metadata to avoid a second lookup during scoring. This should improve performance though it hasn't been directly tested.

If a BooleanQuery consist of only MUST TermQuery clauses, then a specialized ConjunctionTermScorer is used, giving ~25% speedup.

Reducing merge IO impact

Merging (consolidating many small segments into a single big one) is a very IO and CPU intensive operation which can easily interfere with ongoing searches. In 4.0.0 we now have two ways to reduce this impact:
  • Rate-limit the IO caused by ongoing merging, by calling FSDirectory.setMaxMergeWriteMBPerSec.

  • Use the new NativeUnixDirectory which bypasses the OS's IO cache for all merge IO, by using direct IO. This ensures that a merge won't evict hot pages used by searches. (Note that there is also a native WindowsDirectory, but it does not yet use direct IO during merging... patches welcome!).

Remember to also set swappiness to 0 on Linux if you want to maximize search responsiveness.

More generally, the APIs that open an input or output file (Directory.openInput and Directory.createOutput) now take an IOContext describing what's being done (e.g., flush vs merge), so you can create a custom Directory that changes its behavior depending on the context.

These changes were part of a 2011 Google Summer of Code project (thank you Varun!).

Consolidated modules

The diverse sources, previously scattered between Lucene's and Solr's core and contrib, have been consolidated. Especially noteworthy is the analysis module, providing a rich selection of 48 analyzers across many languages; the queries module, containing function queries (the old core function queries have been removed) and other non-core query classes; and the queryparser module, with numerous query parsers including the classic QueryParser (moved from core).

Other changes

Here's a long list of additional changes:
  • The classic QueryParser now interprets term~N where N is an integer >= 1 as a FuzzyQuery with edit distance N.

  • The field cache normally requires single-valued fields, but we've added FieldCache.getDocTermsOrd which can handle multi-valued fields.

  • Analyzers must always provide a reusable token stream, by implementing the Analyzer.createComponents method (reusableTokenStream has been removed and tokenStream is now final, in Analzyer).

  • IndexReaders are now read-only (you cannot delete document by id, nor change norms) and are strongly typed as AtomicIndexReader or CompositeIndexReader

  • The API for reading term vectors is the same API used for reading all postings, except the term vector API only covers a single document. This is a good match because term vectors are really just a single-document inverted index.

  • Positional queries (PhraseQuery, SpanQuery) will now throw an exception if you run them against a field that did not index positions (previously they silently returned 0 hits).

  • String-based field cache APIs have been replaced with BytesRef based APIs.

  • ParallelMultiSearcher has been absorbed into IndexSearcher as an optional ExecutorService argument to the constructor. Searcher and Searchable have been removed.

  • All serialization code has been removed from Lucene's classes; you must handle serialization at a higher level in your application.

  • Field names are no longer interned, so you cannot rely on == to test for equality (use .equals instead).

  • *SpanFilter has been removed: they created too many objects during searching and were not scalable.

  • Removed IndexSearcher.close: IndexSearcher now only takes a provided IndexReader (no longer a Directory), which is the caller's responsibility to close.

  • You cannot put foreign files into the index directory anymore: they will be deleted by IndexWriter.

  • FieldSelector (to only load certain stored fields) has been replaced with a simpler StoredFieldVisitor API.

Happy searching!

Monday, May 14, 2012

Finite State Automata in Lucene

Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene.

You can view the slides here.

This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to "save" your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.

Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!