Sunday, December 30, 2012

A new Lucene highlighter is born

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it's the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google's Chrome browser has an ingenious solution to the final inch problem, when you use "Find..." to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn't store the full token graph.

Unlike the existing highlighters, which rely on term-vectors or on re-analysis of each matched document to obtain the per-token offsets, PostingsHighlighter uses the recently added postings offsets feature. To index postings offsets you must set the field to be highlighted to use FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS option during indexing.

It turns out postings offsets is much more efficient storage for offsets because the default codec (currently Lucene41) does a good job compressing them: ~1.1 byte per position, which includes both start and end offset. In contrast, term vectors require substantially more disk space (~7.8X for the 10 million document English Wikipedia index), slow down indexing and merging, and are slow to access at search time. A smaller index also means the "working set" size, i.e. the net number of bytes that your search application frequently hits from disk, will be smaller, so you'll need less RAM to keep the index hot.

PostingsHighlighter uses a BreakIterator to find passages in the text; by default it breaks using getSentenceIterator. It then iterates in parallel (merge sorting by offset) through the positions of all terms from the query, coalescing those hits that occur in a single passage into a Passage, and then scores each Passage using a separate PassageScorer.

The scoring model is fun: it treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. The default PassageScorer uses BM25 scoring, biased with a normalization factor that favors passages occurring closer to the start of the document, but it's pluggable so you can implement your own scoring (and feel free to share if you find an improvement!).

This new highlighter should be substantially faster than our existing highlighters on a cold index (when the index doesn't fit entirely into available RAM), as it does more sequential IO instead of seek-heavy random access. Furthermore, as you increase the number of top hits, the performance gains should be even better. Also, the larger the documents the better the performance gains should be.

One known limitation is that it can only highlight a single field at a time, i.e. you cannot pass it N fields and have it pick the best passages across all of them, though both existing highlighters have the same limitation. The code is very new and may still have some exciting bugs! This is why it's located under Lucene's sandbox module.

If you are serious about highlighting in your search application (and you should be!) then PostingsHighlighter is well worth a look!

Sunday, December 9, 2012

Fun with Lucene's faceted search module

These days faceted search and navigation is common and users have come to expect and rely upon it.

Lucene's facet module, first appearing in the 3.4.0 release, offers a powerful implementation, making it trivial to add a faceted user interface to your search application. Shai Erera wrote up a nice overview here and worked through nice "getting started" examples in his second post.

The facet module has not been integrated into Solr, which has an entirely different implementation, nor into ElasticSearch, which also has its own entirely different implementation. Bobo is yet another facet implementation! I'm sure there are more...

The facet module can compute the usual counts for each facet, but also has advanced features such as aggregates other than hit count, sampling (for better performance when there are many hits) and complements aggregation (for better performance when the number of hits is more than half of the index). All facets are hierarchical, so the app is free to index an arbitrary tree structure for each document. With the upcoming 4.1, the facet module will fully support near-real-time (NRT) search.

Lucene's nightly performance benchmarks

I was curious about the performance of faceted search, so I added date facets, indexed as year/month/day hierarchy, to the nightly Lucene benchmarks. Specifically I added faceting to all TermQuerys that were already tested, and now we can watch this graph to track our faceted search performance over time. The date field is the timestamp of the most recent revision of each Wikipedia page.

Simple performance tests

I also ran some simple initial tests on a recent (5/2/2012) English Wikipedia export, which contains 30.2 GB of plain text across 33.3 million documents. By default, faceted search retrieves the counts of all facet values under the root node (years, in this case):
     Date (3994646)
       2012 (1990192)
       2011 (752327)
       2010 (380977)
       2009 (275152)
       2008 (271543)
       2007 (211688)
       2006 (98809)
       2005 (12846)
       2004 (1105)
       2003 (7)
It's interesting that 2012 has such a high count, even though this export only includes the first five months and two days of 2012. Wikipedia's pages are very actively edited!

The search index with facets grew only slightly (~2.3%, from 12.5 GB to 12.8 GB) because of the additional indexed facet field. The taxonomy index, which is a separate index used to map facets to fixed integer codes, was tiny: only 120 KB. The more unique facet values you have, the larger this index will be.

Next I compared search performance with and without faceting. A simple TermQuery (party), matching just over a million hits, was 51.2 queries per second (QPS) without facets and 3.4 QPS with facets. While this is a somewhat scary slowdown, it's the worst case scenario: TermQuery is very cheap to execute, and can easily match a large number of hits. The cost of faceting is in proportion to the number of hits. It would be nice to speed this up (patches welcome!).

I also tested a harder PhraseQuery ("the village"), matching 194 K hits: 3.8 QPS without facets and 2.8 QPS with facets, which is less of a hit because PhraseQuery takes more work to match each hit and generally matches fewer hits.

Loading facet data in RAM

For the above results I used the facet defaults, where the per-document facet values are left on disk during aggregation. If you have enough RAM you can also load all facet values into RAM using the CategoryListCache class. I tested this, and it gave nice speedups: the TermQuery was 73% faster (to 6.0 QPS) and the PhraseQuery was 19% faster.

However, there are downsides: it's time-consuming to initialize (4.6 seconds in my test), and not NRT-friendly, though this shouldn't be so hard to fix (patches welcome!). It also required a substantial 1.9 GB RAM, according to Lucene's RamUsageEstimator. We should be able to reduce this RAM usage by switching to Lucene's fast packed ints implementation from the current int[][] it uses today, or by using DocValues to hold the per-document facet data. I just opened LUCENE-4602 to explore DocValues and initial results look very promising.


Next I tried sampling, where the facet module visits 1% of the hits (by default) and only aggregates counts for those. In the default mode, this sampling is used only to find the top N facet values, and then a second pass computes the correct count for each of those values. This is a good fit when the taxonomy is wide and flat, and counts are pretty evenly distributed. I tested that, but results were slower, because the date taxonomy is not wide and flat and has rather lopsided counts (2012 has the majority of hits).

You can also skip the second pass and then present approximate counts or a percentage value to the user. I tested that and saw sizable gains: the TermQuery was 248% (2.5X) faster (to 12.2 QPS) and the PhraseQuery was 29% faster (to 3.6 QPS). The sampling is also quite configurable: you can set the min and max sample sizes, the sample ratio, the threshold under which no sampling should happen, etc.

Lucene's facet module makes it trivial to add facets to your search application, and offers useful features like sampling, alternative aggregates, complements, RAM caching, and fully customizable interfaces for many aspects of faceting. I'm hopeful we can reduce the RAM consumption for caching, and speed up the overall performance, over time.

Monday, November 19, 2012

Lucene with Zing, Part 2

When I last tested Lucene with the Zing JVM the results were impressive: Zing's fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene's 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory's max throughput was now better than RAMDirectory's (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory's performance, and lower the pressure on concurrent mark/sweep's GC to the point where pause times were so much lower in GA compared to alpha?

Fortunately, Zing includes a nice tool, gcLogAnalyser, to visualize all sorts of data extracted from its GC logs. Like Zing's JVM, this tool is also free for open-source development. To enable GC logging you should pass -verbose:gc and -XX:+PrintGCDetails to the Zing JVM, and it's also a good idea to redirect all GC output to a separate file using -Xloggc so that any other output isn't mixed in.

Using gcLogAnalyser it was clear that, for my test, Lucene 4.0.0 alpha had a much higher allocation rate compared to 4.0.0 GA, and from that hint I isolated the difference to this sneaky little performance bug (LUCENE-4289). That issue, present in 4.0.0 alpha and now fixed in 4.0.0 GA, didn't break anything but was causing FastVectorHighlighter to do far too many term lookups internally, consuming too much CPU, slowing down the max throughput and generating extra garbage collection pressure in the process.

It was great that Zing handled the higher allocation rate just fine while concurrent mark/sweep hit long pauses and I wanted to see if this same long pause behavior might happen again if I changed the test.

So I constructed a new test, using 4.0.0 GA, that would bring back some of the garbage collector stress by using term queries matching fewer hits (~170K hits, versus up to ~3M hits before). This would mean more time will be spent hiliting, which is somewhat GC intensive, and less time would be spent retrieving hits, which generates very little garbage. I also increased the near-real-time indexing rate from 100 KB/sec to 2 MB/sec, and ran on the same full English Wikipedia index, but with deleted documents included (net index was ~70% larger in bytes, but the same 33.3 million number of live documents). The net index size was 132G and max heap was 240G (the computer has 512G RAM, plenty!). Each test (one data point in the graphs below) ran for one hour, recording the response time of all queries. I discard the first 5 minute's worth of queries to allow for warmup.

I set concurrent mark/sweep's new generation to 1 GB (-XX:NewSize=1g) because if I didn't, it would strangely sometimes pick a tiny (~18 MB) new generation size which would kill its performance (~10X slower throughput).

Again I tried the experimental G1 collector and it had awful (~130 second) GC hangs, so I left it out.

The first surprising result was the max (saturated) throughput was nearly 2X faster with Zing than with concurrent mark/sweep using RAMDirectory; previously the two had nearly the same maximum throughput:

The graph plots the response time of the slowest query against the actual QPS achieved (total number of queries actually answered in the hour). It's also interesting that MMapDirectory gets higher max throughput than RAMDirectory with concurrent mark/sweep; this is likely because concurrent mark/sweep has higher overhead with a large heap. It could also be in my previous test that concurrent mark/sweep was picking too-small size for the new generation (I had left it at its defaults before).

Here's the same graph, but discarding the over-capacity data points so we can see worst case query latencies below the max throughput:

Again concurrent mark/sweep has non-trivial pauses with both MMapDirectory and RAMDirectory, even at low loads, while Zing remains remarkably flat. What's going on?

Concurrent mark/sweep new generation bottleneck

I suspect in this test the new generation collection is a bottleneck, since it's still a stop-the-world collection, and this explains concurrent mark/sweep's slower max throughput when compared to Zing on RAMDirectory. From the logs, I see the new gen collection running 3-4 times per second and typically taking ~125 msec but sometimes up to 400-500 msec.

Unfortunately, if your application has a high allocation rate, you're stuck between a rock and a hard place: if you tune the new generation size smaller, the throughput drops (this test is already much slower throughput than Zing), but if you tune it larger then you have longer stop-the-world pauses. Zing's C4 collector doesn't have this problem because even its new generation collector is fully concurrent, so it gains the efficiency of a larger new generation without suffering longer pause times.

Furthermore, I think Lucene's segment merges create an allocation pattern that exacerbates the new generation bottleneck: suddenly a large number of long-lived objects are created, causing high-latency new generation collections because these objects are not only copied out of the eden space but must also bounce back and forth in the survivor spaces until they are promoted. This "sudden" arrival of a large number and net size of long-lived objects can cause substantial pauses due to new generation collection.

Concurrent mark/sweep old generation fragmentation

The other, more serious failure mode of concurrent mark/sweep is fragmentation of the old generation. When this strikes, you'll see a "promotion failed" followed by a full (compacting) GC in the GC log which can easily take 10s of seconds on a large heap. I wasn't able to provoke this in my limited testing (maybe I didn't run the tests for long enough), but for a long-running application it's inevitable that eventually you'll hit promotion failed. It's a very real issue, and it comes up frequently on the Solr user's list; here's a recent example.

Azul has a nifty free tool, Fragger, that easily provokes fragmentation and full collection, while only using a small (10% by default) part of the heap and allocating at a light (20 MB/sec default) rate. It "wraps" around any other java command-line, and accelerates testing since it fragments the old gen faster than your application normally would. I ran a quick test, adding Fragger to the search test, and sure enough concurrent mark/sweep hit horrible pauses while Zing was unaffected.

In summary, I remain impressed with Zing, and I wish its C4 collector were the default for Java! Then we all would stop having any GC worries and could freely use Java with very large heaps, fundamentally changing how we build software in this age of very cheap RAM. Zing works fine at its defaults, while concurrent mark/sweep requires substantial tuning, and even then will eventually hit very high pause times due to the new generation bottleneck and old generation fragmentation.

By the way, a very useful JVM option is -XX:+PrintGCApplicationStoppedTime. This prints the net time of every stop-the-world event, like this:
  Total time for which application threads were stopped: 0.4178300 seconds
You can use this to see which stop-the-world phase of your garbage collector is causing your application's hangs.

Friday, September 28, 2012

Lucene's new analyzing suggester

Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago.

In Lucene we have several different suggest implementations, under the suggest module; today I'm describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).

To use it, you provide the set of suggest targets, which is the full set of strings and weights that may be suggested. The targets can come from anywhere; typically you'd process your query logs to create the targets, giving a higher weight to those queries that appear more frequently. If you sell movies you might use all movie titles with a weight according to sales popularity.

You also provide an analyzer, which is used to process each target into analyzed form. Under the hood, the analyzed form is indexed into an FST. At lookup time, the incoming query is processed by the same analyzer and the FST is searched for all completions sharing the analyzed form as a prefix.

Even though the matching is performed on the analyzed form, what's suggested is the original target (i.e., the unanalyzed input). Because Lucene has such a rich set of analyzer components, this can be used to create some useful suggesters:
  • With an analyzer that folds or normalizes case, accents, etc. (e.g., using ICUFoldingFilter), the suggestions will match irrespective of case and accents. For example, the query "ame..." would suggest Amélie.

  • With an analyzer that removes stopwords and normalizes case, the query "ghost..." would suggest "The Ghost of Christmas Past".

  • Even graph TokenStreams, such as SynonymFilter, will work: in such cases we enumerate and index all analyzed paths into the FST. If the analyzer recognizes "wifi" and "wireless network" as synonyms, and you have the suggest target "wifi router" then the user query "wire..." would suggest "wifi router".

  • Japanese suggesters may now be possible, with an analyzer that copies the reading (ReadingAttribute in the Kuromoji analyzer) as its output.

Given the diversity of analyzers, and the easy extensibility for applications to create their own analyzers, I'm sure there are many interesting use cases for this new AnalyzingSuggester: if you have an example please share with us on Lucene's user list (

While this is a great step forward, there's still plenty to do with Lucene's suggesters. We need to allow for fuzzy matching on the query so we're more robust to typos (there's a rough prototype patch on LUCENE-3846). We need to predict based on only part of the query, instead of insisting on a full prefix match. There are a number of interesting elements to Google's autosuggest that we could draw inspiration from. As always, patches welcome!

Tuesday, August 21, 2012

Lucene's new BlockPostingsFormat, thanks to Google Summer of Code

Another summer wraps up, another successful Google Summer of Code project finished in Apache Lucene!

We had two projects this summer; here I'll describe Han Jiang's project (I was his mentor). The second project, separating StorableField and IndexableField, is also very exciting and wrapping up now.

The goal of Han's project was to create a compelling replacement int-block postings format, improving on the aging variable-int postings format that Lucene has used forever. The project was definitely a success! This morning I merged over the feature branch we had been using all summer, committing to both Lucene's trunk (future 5.x) and Lucene 4.x branch (to be 4.0 GA). This means Lucene 4.0 GA will have the new BlockPostingsFormat available.

The new postings format shows incredible search performance gains versus the current default postings format. I tested with luceneutil on the full May 2012 English Wikipedia index (33.3 million documents, 15 GB index):

TaskQPS baseStdDev baseQPS packedStdDev packed% change

The gains are awesome! After this change has baked for a while I expect we'll make it the default postings format, maybe in time for 4.1.

This effort has been a long time coming: ever since the creation of flexible indexing, allowing for full pluggability of all files in the index, we have wanted to improve on Lucene's postings format. Early int-block prototypes showed great performance.

The new format stores postings data in fixed blocks of 128 ints. Each block is encoded using a simple packed ints format. The .doc file interleaves docID delta and term frequency blocks, and also includes skip data, while the .pos file holds position deltas. Offsets and payloads, if present, are stored in the .pay file. On the full English Wikipedia index used in the above test this resulted in a nice reduction of disk usage: 13.7 GB (new) vs 14.7 GB (current). That's ~7% smaller.

The format only uses packed int blocks for high-frequency terms: low frequency terms (that appear in less than 128 documents), and the final partial-block of a high frequency term, are stored in the same variable-int format as today's default. Skip points are stored at the start of each block (except the first block).

Han built and tested a number of fun ideas, including skipping within blocks (with partial block decoding), using PForDelta instead of simple packed ints, wrapping with PulsingPostingsFormat, all sorts of different ways to decode packed ints, different values for blockSize and skipMultiplier, encoding all-single-values very compactly, etc. We spent lots of electricity running all sorts of benchmarks. Some ideas worked and some didn't, and things did not go according to the original plan! Such is the healthy unpredictable, zig-zag, iterative nature of open-source development/exploration, yet in the end we find ourselves with a great new postings format.

Thank you Han Jiang for such an important improvement to Lucene, and thank you Google for sponsoring this effort through Google Summer of Code.

Tuesday, July 31, 2012

Lucene index in RAM with Azul's Zing JVM

Google's entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?

RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.

The obvious approach is to load the index into Lucene's RAMDirectory, right?

Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a List of byte[1024] fragments (there are open Jira issues to address this but they haven't been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from RAMDirectory back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using RAMDirectory only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using MMapDirectory (see Uwe's excellent post for more details).

While there are open issues to improve RAMDirectory (LUCENE-4123 and LUCENE-3659), they haven't been committed and many users simply use RAMDirectory anyway.

Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of RAMDirectory should not be a problem for Zing. Let's test it! But first, a quick digression on the importance of measuring search response time of all requests.

Search response time percentiles

Normally our performance testing scripts (luceneutil) measure the average response time, discarding outliers. We do this because we are typically testing an algorithmic change and want to see the effect of that change, ignoring confounding effects due to unrelated pauses from the the OS, IO systems, or GC, etc.

But for a real search application what matters is the total response time of every search request. Search is a fundamentally interactive process: the user sits and waits for the results to come back and then iterates from there. If even 1% of the searches take too long to respond that's a serious problem! Users are impatient and will quickly move on to your competitors.

So I made some improvements to luceneutil, including separating out a load testing client ( that records the response time of all queries, as well as scripts ( and to generate the resulting response-time graphs, and a top-level to run a series of tests at increasing loads (queries/sec), automatically stopping once the load is clearly beyond the server's total capacity. As a nice side effect, this also determines the true capacity (max throughput) of the server rather than requiring an approximate measure by extrapolating from the average search latency.

Queries are sent according to a Poisson distribution, to better model the arrival times of real searches, and the client is thread-less so that if you are testing at 200 queries/sec and the server suddenly pauses for 5 seconds then there will be 1000 queries queued up once it wakes up again (this fixes an unfortunately common bug in load testers that dedicate one thread per simulated client).

The client can run (via password-less ssh) on a separate machine; this is important because if the server machine itself (not just the JVM) is experiencing system-wide pauses (e.g. due to heavy swapping) it can cause pauses in the load testing client which will skew the results. Ideally the client runs on an otherwise idle machine and experiences no pauses. The client even disables Python's cyclic garbage collector to further reduce the chance of pauses.

Wikipedia tests

To test Zing, I first indexed the full Wikipedia English database (as of 5/2/2012), totalling 28.78 GB plain text across 33.3 M 1 KB sized documents, including stored fields and term vectors, so I could highlight the hits using FastVectorHighlighter. The resulting index was 78 GB. For each test, the server loads the entire index into RAMDirectory and then the client sends the top 500 hardest (highest document frequency) TermQuerys, including stop words, to the server. At the same time, the server re-indexes documents (updateDocument) at the rate of 100/sec (~100 KB/sec), and reopens the searcher once per second.

Each test ran for an hour, discarding results for the first 5 minutes to allow for warmup. Max heap is 140 GB (-Xmx 140G). I also tested MMapDirectory, with max heap of 4 GB, as a baseline. The machine has 32 cores (64 with hyper-threading) and 512 GB of RAM, and the server ran with 40 threads.

I tested at varying load rates (queries/sec), from 50 (very easy) to 275 (too hard), and then plotted the resulting response time percentiles for different configurations. The default Oracle GC (Parallel) was clearly horribly slow (10s of seconds collection time) so I didn't include it. The experimental garbage first (G1) collector was even slower starting up (took 6 hours to load the index into RAMDirectory, vs 900 seconds for Concurrent Mark/Sweep (CMS)), and then the queries hit > 100 second latencies, so I also left it out (this was surprising as G1 is targeted towards large heaps). The three configurations I did test were CMS at its defaults settings, with MMapDirectory as the baseline, and both CMS and Zing with RAMDirectory.

At the lightest load (50 QPS), Zing does a good job maintaining low worst-case response time, while CMS shows long worst case response times, even with MMapDirectory:

To see the net capacity of each configuration, I plotted the 99% response time, across different load rates:

From this it's clear that the peak throughput for CMS + MMap was somewhere between 100 and 150 queries/sec, while the RAMDirectory based indices were somewhere between 225 and 250 queries/second. This is an impressive performance gain! It's also interesting because in separately testing RAMDirectory vs MMapDirectory I usually only see minor gains when measuring average query latency.

Plotting the same graph, without CMS + MMapDirectory and removing the 275 queries/second point (since it's over-capacity):

Zing remains incredibly flat at the 99% percentile, while CMS has high response times already at 100 QPS. At 225 queries/sec load, the highest near-capacity rate, for just CMS and Zing on RAMDirectory:

The pause times for CMS are worse than they were at 50 QPS: already at the 95% percentile the response times are too slow (4479 milli-seconds).

Zing works!

It's clear from these tests that Zing really has a very low-pause garbage collector, even at high loads, while managing a 140 GB max heap with 78 GB Lucene index loaded into RAMDirectory. Furthermore, applications can expect a substantial increase in max throughput (around 2X faster in this case) and need not fear using RAMDirectory even for large indices, if they can run under Zing.

Note that Azul has just made the Zing JVM freely available to open-source developers, so now we all can run our own experiments, add Zing into the JVM rotation for our builds, etc.

Next I will test the new DirectPostingsFormat which holds all postings in RAM in simple arrays (no compression) for fast search performance. It requires even more RAM than this test, but gives even faster search performance!

Thank you to Azul for providing a beefy server machine and the Zing JVM to run these tests!

Sunday, July 29, 2012

Building a new Lucene postings format

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene's index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

The trickiest format to create is PostingsFormat, which provides read/write access to all postings (fields, terms, documents, frequencies, positions, offsets, payloads). Part of the challenge is that it has a large API surface area. But there are also complexities such as skipping, reuse, conditional use of different values in the enumeration (frequencies, positions, payloads, offsets), partial consumption of the enumeration, etc. These challenges unfortunately make it easy for bugs to sneak in, but an awesome way to ferret out all the bugs is to leverage Lucene's extensive randomized tests: run all tests with -Dtests.postingsformat=XXX (be sure to first register your new postings format). If your new postings format has a bug, tests will most likely fail.

However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.

I recently created a new postings format, BlockPostingsFormat, which will hopefully be more efficient than the Sep codec at using fixed int block encodings. I did this to support Han Jiang's Google Summer of Code project to add a useful int block postings format to Lucene.

So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test, TestPostingsFormat. It has layers of testing (documents, +freqs, +positions, +payloads, +offsets) that you can incrementally enable as you iterate, as well as different test options (skipping or not, reuse or not, stop visiting documents and/or positions early, one or more threads, etc.). When you turn on verbose (-Dtests.verbose=true) the test prints clear details of everything it indexed and what exactly it's testing so a failure is easy to debug. I'm very happy with the results: I found this to be a much more productive way to create a new postings format.

The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in TestPostingsFormat! (Who tests the tester?)

If you find yourself creating a new postings format I strongly suggest using the new TestPostingsFormat during early development to get your postings format off the ground. Once it's passing, run all tests with your new postings format, and if something fails please let us know so we can fix TestPostingsFormat.

Tuesday, July 3, 2012

Lucene 4.0.0 alpha, at long last!

The 4.0.0 alpha release of Lucene and Solr is finally out!

This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:
  • All deprecated APIs as of 3.6.0 have been removed.

  • Pre-3.0 indices are no longer supported.

  • MIGRATE.txt describes how to update your application code.

  • The index format won't change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta.

Please try the release and report back!

Pluggable Codec

The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.

There are some fun core codecs:
  • Lucene40 is the default codec.

  • Lucene3x (read-only) reads any index written with Lucene 3.x.

  • SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!).

  • MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.)

  • PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup.

  • AppendingCodec avoids seeking while writing, necessary for file-systems such as Hadoop DFS.

If you create your own Codec it's easy to confirm all of Lucene/Solr's tests pass with it. If tests fail then likely your Codec has a bug!

A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.

Flexible scoring

Lucene's scoring is now fully pluggable, with the TF/IDF vector space model remaining as the default. You can create your own scoring model, or use one of the core scoring models (BM25, Divergence from Randomness, Language Models, and Information-based models). Per-document normalization values are no longer limited to a single byte. Various new aggregate statistics are now available.

These changes were part of a 2011 Google Summer of Code project (thank you David!).

These two changes are really important because they remove the barriers to ongoing innovations. Now it's easy to experiment with wild changes to the index format or to Lucene's scoring models. A recent example of such innovation is this neat codec by the devs at Flax to enable updatable fields by storing postings in a Redis key/value store.

Document Values

The new document values API stores strongly typed single-valued fields per document, meant as an eventual replacement for Lucene's field cache. The values are pre-computed during indexing and stored in the index in a column-stride format (values for a single field across all documents are stored together), making it much faster to initialize at search time than the field cache. Values can be fixed 8, 16, 32, 64 bit ints, or variable-bits sized (packed) ints; float or double; and six flavors of byte[] (fixed size or variable sized; dereferenced, straight or sorted).

New Field APIs

The API for creating document fields has changed: Fieldable and AbstractField have been removed, and a new FieldType, factored out of Field class, holds details about how the field's value should be indexed. New classes have been created for specific commonly-used fields:
  • StringField indexes a string as a single token, without norms and as docs only. For example, use this for a primary key (id) field, or for a field you will sort on.

  • TextField indexes the fully tokenized string, with norms and including docs, term frequencies and positions. For example, use this for the primary text field.

  • StoredField is a field whose value is just stored.

  • XXXDocValuesField create typed document values fields.

  • IntField, FloaField, LongField, DoubleField create typed numeric fields for efficient range queries and filters.

If none of these field classes apply, you can always create your own FieldType (typically by starting from the exposed FieldTypes from the above classes and then tweaking), and then construct a Field by passing the name, FieldType and value.

Note that the old APIs (using Index, Store, TermVector enums) are still present (deprecated), to ease migration.

These changes were part of a 2011 Google Summer of Code project (thank you Nikola!).

Other big changes

Lucene's terms are now binary (arbitrary byte[]); by default they are UTF-8 encoded strings, sorted in Unicode sort order. But your Analyzer is free to produce tokens with an arbitrary byte[] (e.g., CollationKeyAnalyzer does so).

A new DirectSpellChecker finds suggestions directly from any Lucene index, avoiding the hassle of maintaining a sidecar spellchecker index. It uses the same fast Levenshtein automata as FuzzyQuery (see below).

Term offsets (the start and end character position of each term) may now be stored in the postings, by using FieldInfo.IndexOption.DOCS_AND_POSITIONS_AND_OFFSETS when indexing the field. I expect this will be useful for fast highlighting without requiring term vectors, but this part is not yet done (patches welcome!).

A new AutomatonQuery matches all documents containing any term matching a provided automaton. Both WildcardQuery and RegexpQuery simply construct the corresponding automaton and then run AutomatonQuery. The classic QueryParser produces a RegexpQuery if you type fieldName:/expression/ or /expression against default field/.


Beyond the fun new features there are some incredible performance gains.

If you use FuzzyQuery, you should see a factor of 100-200 speedup on moderately sized indices.

If you search with a Filter, you can see gains up to 3X faster (depending on filter density and query complexity), thanks to a change that applies filters just like we apply deleted documents.

If you use multiple threads for indexing, you should see stunning throughput gains (265% in that case), thanks to concurrent flushing. You are also now able to use more than 2048 MB IndexWriter RAM buffer (as long as you use multiple threads).

The new BlockTree default terms dictionary uses far less RAM to hold the terms index, and can sometimes avoid going to disk for terms that do not exist. In addition, the field cache also uses substantially less RAM, by avoiding separate objects per document and instead packing character data into shared byte[] blocks. Together this results in a 73% reduction in RAM required for searching in one case.

IndexWriter now buffers term data using byte[] instead of char[], using half the RAM for ASCII terms.

MultiTermQuery now rewrites per-segment, and caches per-term metadata to avoid a second lookup during scoring. This should improve performance though it hasn't been directly tested.

If a BooleanQuery consist of only MUST TermQuery clauses, then a specialized ConjunctionTermScorer is used, giving ~25% speedup.

Reducing merge IO impact

Merging (consolidating many small segments into a single big one) is a very IO and CPU intensive operation which can easily interfere with ongoing searches. In 4.0.0 we now have two ways to reduce this impact:
  • Rate-limit the IO caused by ongoing merging, by calling FSDirectory.setMaxMergeWriteMBPerSec.

  • Use the new NativeUnixDirectory which bypasses the OS's IO cache for all merge IO, by using direct IO. This ensures that a merge won't evict hot pages used by searches. (Note that there is also a native WindowsDirectory, but it does not yet use direct IO during merging... patches welcome!).

Remember to also set swappiness to 0 on Linux if you want to maximize search responsiveness.

More generally, the APIs that open an input or output file (Directory.openInput and Directory.createOutput) now take an IOContext describing what's being done (e.g., flush vs merge), so you can create a custom Directory that changes its behavior depending on the context.

These changes were part of a 2011 Google Summer of Code project (thank you Varun!).

Consolidated modules

The diverse sources, previously scattered between Lucene's and Solr's core and contrib, have been consolidated. Especially noteworthy is the analysis module, providing a rich selection of 48 analyzers across many languages; the queries module, containing function queries (the old core function queries have been removed) and other non-core query classes; and the queryparser module, with numerous query parsers including the classic QueryParser (moved from core).

Other changes

Here's a long list of additional changes:
  • The classic QueryParser now interprets term~N where N is an integer >= 1 as a FuzzyQuery with edit distance N.

  • The field cache normally requires single-valued fields, but we've added FieldCache.getDocTermsOrd which can handle multi-valued fields.

  • Analyzers must always provide a reusable token stream, by implementing the Analyzer.createComponents method (reusableTokenStream has been removed and tokenStream is now final, in Analzyer).

  • IndexReaders are now read-only (you cannot delete document by id, nor change norms) and are strongly typed as AtomicIndexReader or CompositeIndexReader

  • The API for reading term vectors is the same API used for reading all postings, except the term vector API only covers a single document. This is a good match because term vectors are really just a single-document inverted index.

  • Positional queries (PhraseQuery, SpanQuery) will now throw an exception if you run them against a field that did not index positions (previously they silently returned 0 hits).

  • String-based field cache APIs have been replaced with BytesRef based APIs.

  • ParallelMultiSearcher has been absorbed into IndexSearcher as an optional ExecutorService argument to the constructor. Searcher and Searchable have been removed.

  • All serialization code has been removed from Lucene's classes; you must handle serialization at a higher level in your application.

  • Field names are no longer interned, so you cannot rely on == to test for equality (use .equals instead).

  • *SpanFilter has been removed: they created too many objects during searching and were not scalable.

  • Removed IndexSearcher.close: IndexSearcher now only takes a provided IndexReader (no longer a Directory), which is the caller's responsibility to close.

  • You cannot put foreign files into the index directory anymore: they will be deleted by IndexWriter.

  • FieldSelector (to only load certain stored fields) has been replaced with a simpler StoredFieldVisitor API.

Happy searching!

Monday, May 14, 2012

Finite State Automata in Lucene

Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene.

You can view the slides here.

This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to "save" your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.

Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!

Monday, April 30, 2012

Lucene's TokenStreams are actually graphs!

Lucene's TokenStream class produces the sequence of tokens to be indexed for a document's fields. The API is an iterator: you call incrementToken to advance to the next token, and then query specific attributes to obtain the details for that token. For example, CharTermAttribute holds the text of the token; OffsetAttribute has the character start and end offset into the original string corresponding to this token, for highlighting purposes. There are a number of standard token attributes, and some tokenizers add their own attributes.

The TokenStream is actually a chain, starting with a Tokenizer that splits characters into initial tokens, followed by any number of TokenFilters that modify the tokens. You can also use a CharFilter to pre-process the characters before tokenization, for example to strip out HTML markup, remap character sequences or replace characters according to a regular expression, while preserving the proper offsets back into the original input string. Analyzer is the factory class that creates TokenStreams when needed.

Lucene and Solr have a wide variety of Tokenizers and TokenFilters, including support for at least 34 languages.

Let's tokenize a simple example: fast wi fi network is down. Assume we preserve stop words. When viewed as a graph, the tokens look like this:

Each node is a position, and each arc is a token. The TokenStream enumerates a directed acyclic graph, one arc at a time.

Next, let's add SynoynmFilter into our analysis chain, applying these synonyms:
  • fastspeedy
  • wi fiwifi
  • wi fi networkhotspot
resulting in this graph:

Now the graph is more interesting! For each token (arc), the PositionIncrementAttribute tells us how many positions (nodes) ahead this arc starts from, while the new (as of 3.6.0) PositionLengthAttribute tells us how many positions (nodes) ahead the arc arrives to.

Besides SynonymFilter, several other analysis components now produce token graphs. Kuromoji's JapaneseTokenizer outputs the decompounded form for compound tokens. For example, tokens like ショッピングセンター (shopping center) will also have an alternate path with ショッピング (shopping) followed by センター (center). Both ShingleFilter and CommonGramsFilter set the position length to 2 when they merge two input tokens.

Other analysis components should produce a graph but don't yet (patches welcome!): WordDelimiterFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, NGramTokenFilter, EdgeNGramTokenFilter, and likely others.


There are unfortunately several hard-to-fix problems with token graphs. One problem is that the indexer completely ignores PositionLengthAttribute; it only pays attention to PositionIncrementAttribute. This means the indexer acts as if all arcs always arrive at the very next position, so for the above graph we actually index this:

This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi"). Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage. This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.

QueryParser also ignores position length, however this should be easier to fix. This would mean you can run graph analyzers at query time (i.e., query time expansion) and get the correct results.

Another problem is that SynonymFilter also unexpectedly performs its own form of sausagization when the injected synonym is more than one token. For example if you have this rule:
  • dnsdomain name service
it results in graphs like this:

Notice how name was overlapped onto is, and service was overlapped onto up. It's an odd word salad!

This of course also messes up phrase queries ("domain name service is up" should match but doesn't, while "dns name up" shouldn't match but does). To work around this problem you should ensure all of your injected synonyms are single tokens! For this case, you could run the reverse mapping (domain name servicedns) at query time (as well as indexing time) and then both queries dns and domain name service will match any document containing either variant.

This happens because SynonymFilter never creates new positions; if it did so, it could make new positions for tokens in domain name service, and then change dns to position length 3.

Another problem is that SynonymFilter, like the indexer, also ignores the position length of the incoming tokens: it cannot properly consume a token graph. So if you added a second SynonymFilter it would fail to match hotspot is down.

We've only just started but bit by bit our token streams are producing graphs!

Saturday, April 28, 2012

Lucene has two Google Summer of Code students!

I'm happy to announce that two Lucene Google Summer of Code projects were accepted for this summer!

The first project (LUCENE-3312), proposed by Nikola Tanković, will separate StorableField out of IndexableField, and also fix the longstanding confusing trap that one can retrieve a Document at search time and re-index it, without losing anything. It's unfortunately not true!

The second project (LUCENE-3892), proposed by Han Jiang, will create a postings format using PForDelta compression (PDF). Long ago we saw great performance from PForDelta, but with a hacked up prototype patch that couldn't be committed. The goal of this project is to implement PForDelta "for real" and characterize the resulting performance.

Welcome aboard Nikola and Han!

Wednesday, March 14, 2012

New index statistics in Lucene 4.0

In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.

Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.

To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":
  • document 1: The Lion, the Witch, and the Wardrobe
  • document 2: The Da Vinci Code
Assume we tokenize on whitespace, commas are removed, all terms are downcased and we don't discard stop-words. Here are the statistics Lucene tracks:
How many documents contain at least one occurrence of the term in the field; 3.x indices also save this (TermEnum.docFreq()). For term "lion" docFreq is 1, and for term "the" it's 2.

Number of postings, i.e. sum of TermsEnum.docFreq() across all terms in the field. For our example documents this is 9.

Number of occurrences of this term in the field, across all documents. For term "the" it's 4, for term "vinci" it's 1.

Number of term occurrences in the field, across all documents; this is the sum of TermsEnum.totalTermFreq() across all unique terms in the field. For our example documents this is 11.

How many documents have at least one term for this field. In our example documents, this is 2, but if for example one of the documents was missing the title field, it would be 1.

How many unique terms were seen in this field. For our example documents this is 8. Note that this statistic is of limited utility for scoring, because it's only available per-segment and you cannot (efficiently!) compute this across all segments in the index (unless there is only one segment).

Number of unique terms across all fields; this is the sum of Terms.getUniqueTermCount() across all fields. In our example documents this is 8. Note that this is also only available per-segment.

Number of unique fields. For our example documents this is 1; if we also had a body field and an abstract field, it would be 3. Note that this is also only available per-segment.

3.x indices only store TermsEnum.docFreq(), so if you want to experiment with the new scoring models in Lucene 4.0, you should either re-index or upgrade your index using IndexUpgrader. Note that the new scoring models all use the same single-byte norms format, so you can freely switch between them without re-indexing.

In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the FieldInvertState passed to Similarity.computeNorm method for both 3.x and 4.0:
How many tokens in the document. For document 1 it's 7; for document 2 it's 4.

For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.

What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.

In 3.x, if you want to consume these indexing-time statistics, you'll have to save them away yourself (e.g., somehow encoding them into the single-byte norm value). However, since 4.0 uses doc values for norms, you have more freedom to encode these statistics however you'd like. Your custom similarity can then pull from these.

From these available statistics you're now free to derive other commonly used statistics:
  • Average field length across all documents is Terms.getSumTotalTermFreq() divided by maxDoc (or Terms.getDocCount(), if not all documents have the field).

  • Average within-document field term frequency is FieldInvertState.length divided by FieldInvertState.uniqueTermCount.

  • Average number of unique terms per field across all documents is Terms.getSumDocFreq() divided by maxDoc (or Terms.getDocCount(field), if not all documents have the field).
Remember that the statistics do not reflect deleted documents, until those documents are merged away; in general this also means that segment merging will alter scores! Similarly, if the field omits term frequencies, then the statistics will not be correct (though they will still be consistent with one another: we will pretend each term occurred once per document).

Monday, March 5, 2012

Crowd-data can find drug and vaccine side effects

The social crowd has proven to be powerful, if you can find some way to harness it: crowd-sourcing can perform tasks and solve collaborative problems, crowd-funding can raise substantial financing.

I suspect crowd-data will similarly become an effective way to create large, realistic databases.

A great application of this is the medical world, where many people post to health forums raising medical problems, possible side effects from drugs and vaccines, etc. Why not collect all such posts to find previously undiscovered problems? In fact, this paper describes just that: the authors extracted the nasty side effects of statin drugs based on posts to online health forums. Similarly, this abstract describes a system that used crowd-data to spot nasty side effects from Singulair, years before the FDA issued a warning. The VAERS database, which gathers parent-reported problems after children receive vaccines, is another example.

Unfortunately the drug safety trials that take place before a drug can be released are not especially trustworthy. Here's a scary quote from that interview:

    When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.

And that was 7 years ago! I imagine the situation has only gotten worse.

When a new drug is released, the true, unbiased drug trial begins when millions of guinea-pigs start taking taking it. Crowd-data makes it possible to draw conclusions from that that post-market drug trial.

Of course there are challenging tradeoffs: crowd-data, being derived from "ordinary people" without any rigorous standard collection process, can be dirty, incomplete and reflect sampling bias (only people experiencing nasty side effects speak up). For these reasons, old-fashioned journals turn their noses up at papers drawing conclusions from crowd-data.

Nevertheless, I believe such limitations are more than offset by the real-time nature and shear scale the millions of people, constantly posting information over time. Inevitably, trustworthy patterns will emerge over the noise. Unlike the synthetic drug trial, this data is as real as you can get: sure, perhaps the drug seemed fine in the carefully controlled pre-market testing, but then out in the real world, unexpected interactions can suddenly emerge. Crowd-data will enable us to find such cases quickly and reliably, as long as we still have enough willing guinea-pigs!

Fast forward a few years and I expect crowd-data will be an excellent means of drawing conclusions, and will prove more reliable than the company-funded pre-market drug trials.

Thursday, March 1, 2012

Transactional Lucene

Many users don't appreciate the transactional semantics of Lucene's APIs and how this can be useful in search applications. For starters, Lucene implements ACID properties:
  • Atomicity: when you make changes (adding, removing documents) in an IndexWriter session, and then commit, either all (if the commit succeeds) or none (if the commit fails) of your changes will be visible, never something in-between. Some methods have their own atomic behavior: if you call updateDocument, which is implemented as a delete followed by an add, you'll never see the delete without the add, even if you open a near-real-time (NRT) reader or commit from a separate thread. Similarly if you add a block of documents, using the relatively new addDocuments method, you'll see either none or all of the documents in any reader you obtain.

  • Consistency: if the computer or OS crashes, or the JVM crashes or is killed, or power is lost, your index will remain intact (ie, not corrupt). Note that other problems, such as bad RAM, a bit-flipping CPU or file system corruption, can still easily corrupt the index!

  • Isolation: while IndexWriter is making changes, nothing is visible to any IndexReader searching the index, until you commit or open a new NRT reader. Only one IndexWriter instance at a time can change the index.

  • Durability: once commit returns, all changes have been written to durable storage (assuming your I/O system correctly implements fsync). If the computer or OS crashes, or the JVM crashes or is killed, or power is lost to the computer, all changes will still be present in the index.

Lucene provides a two-phased commit API: call the prepareCommit method to do all of the hard work (applying buffered deletes, writing buffered documents, fsyncing files). If something is going to go wrong (e.g., disk fills up) it'll almost certainly happen during this first phase. Then, call commit to complete the transaction.

When you close the IndexWriter, it calls commit under the hood. If, instead, you want to discard all changes since the last commit, call the rollback method instead, which also closes the writer. You can even rollback a CREATE: if you have an existing index, and you open an IndexWriter on it with OpenMode.CREATE, and then rollback, the index will be unchanged. Likewise, if you call deleteAll and then rollback.

Note that merely opening an IndexWriter on a new directory does not create an empty commit; ie, you cannot open an IndexReader on the directory until you've called commit yourself.

Lucene does not implement a transaction log itself, but it's easy to build that layer out on top. For example, popular search servers such as Solr and ElasticSearch, do so.

Multiple commits in one index

A single Lucene index is free to contain more than one commit; this is a powerful yet often overlooked feature. Each commit holds a point-in-time view of the index as it existed when the commit was created.

This is similar to the snapshots and writable clones available in modern filesystems like ZFS and the up-and-coming Btrfs. In fact, Lucene is able to efficiently expose multiple commits for the very same underlying reason: all index segments and files are write-once, just like the file blocks in ZFS and Btrfs.

To save multiple commits in your index, just implement your own IndexDeletionPolicy and pass it to IndexWriter. This is the class Lucene uses to know which commits should be deleted: IndexWriter invokes it on opening an index and whenever a commit succeeds. The default policy, KeepOnlyLastCommitDeletionPolicy, deletes all but the last commit. If you use NoDeletionPolicy then every commit is retained!

You can pass a userData (Map<String,String>) to commit, to record custom information (opaque to Lucene) about that commit, and then use IndexReader.listCommits to find all commits in the index. Once you've found a commit, you can open an IndexReader on it to search the index as of that commit.

You can also open an IndexWriter on a prior commit, to effectively roll back all changes after it: this is just like the rollback method, except it enables you to rollback across commits and not just the changes made in the current IndexWriter session.

Old commits are still kept even when you open an index with OpenMode.CREATE. It's also fine to pass OpenMode.CREATE when IndexReaders are still searching the old commits. This enables fun use cases, such as fully re-indexing your content between each commit without affecting any open readers.

Combining all of these fun transactional features, you can do some cool things:

  • Hot backups, using SnapshotDeletionPolicy or PersistentSnapshotDeletionPolicy: these deletion policies make it trivial to take a "live" backup of the index without blocking ongoing changes with IndexWriter. The backup can easily be incremental (just copy the new files, remove the deleted ones), and you can freely throttle the IO to minimize any interference with searching.

  • Searching different catalog versions: perhaps you run an e-commerce site, and but you ship multiple versions of your catalog. In this case you can keep older commits around, each searching a specific version of your catalog, enabling users to choose which catalog to search.

  • Repeatable indexing tests from the same initial index: maybe you want to run a bunch of performance tests, perhaps trying different RAM buffer sizes or merge factors, starting from a large initial index. To do this, simply run each test, but in the end, instead of closing the IndexWriter, use the rollback method to quickly return the index to its initial state, ready for the next test.

  • Force all index segments to be merged down to a single segment, but also keep the prior multi-segment commit. Then you can do tests to compare multi-segment vs single-segment performance.

  • Indexing and searching over the NFS file system: because NFS does not protect still-open files from deletion, you must use an IndexDeletionPolicy to keep each commit around until all open readers have finished with the commit (ie, reopened to a newer commit). The simple approach is time-based, for example: don't delete the commit until it is 15 minutes old, and then always reopen your readers every 5 minutes. Without this you'll hit all sorts of scary exceptions when searching over NFS.

  • Distributed commit: if you have other resources that must commit atomically along with the changes to your Lucene index, you can use the two-phased commit API. This is simple, but vulnerable to failures during the 2nd phaes; to also recover from such cases, for example if Lucene completed its 2nd phase commit but the database's 2nd phase hit some error or crash or power loss, you can easily rollback Lucene's commit by opening an IndexWriter on the prior commit.

  • Experimental index changes: maybe you want to try re-indexing some subset of your index in a new way, but you're not sure it'll work out. In this case, just keep the old commit around, and then rollback if it didn't work out, or delete the old commit if it did.

  • Time-based snapshots: maybe you'd like the freedom to roll back to your index as it existed 1 day ago, 1 week ago, 1 month ago, etc., so you preserve commits based on their age.
Remember that keeping more than one commit alive will necessarily consume additional disk space, however, the overhead is often small since the multiple commits will usually share common segments, especially the larger, older ones.

Saturday, January 14, 2012

ToChildBlockJoinQuery in Lucene

In my last post I described a known limitation of BlockJoinQuery: it joins in only one direction (from child to parent documents). This can be a problem because some applications need to join in reverse (from parent to child documents) instead.

This is now fixed! I just committed a new query, ToChildBlockJoinQuery, to perform the join in the opposite direction. I also renamed the previous query to ToParentBlockJoinQuery.

You use it just like BlockJoinQuery, except in reverse: it wraps any other Query matching parent documents and translates it into a Query matching child documents. The resulting Query can then be combined with other queries against fields in the child documents, and you can then sort by child fields as well.

Using songs and albums as an example: imagine you index each song (child) and album (parent) as separate documents in a single document block. With ToChildBlockJoinQuery, you can now run queries like:
  albumName:thunder AND songName:numb
  albumName:thunder, sort by songTitle
Any query with constraints against album and/or song fields will work, and the returned hits will be individual songs (not grouped).

ToChildBlockJoinQuery will be available in Lucene 3.6.0 and 4.0.

Sunday, January 8, 2012

Searching relational content with Lucene's BlockJoinQuery

Lucene's 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content.

Most search engines can't directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.

Perhaps the PDF documents you need to search are immense, so you break them up and index each section as a separate Lucene document; in this case you'll have common fields (title, abstract, author, date published, etc.) for the overall document, joined to the sub-document (section) with its own fields (text, page number, etc.). XML documents typically contain nested tags, representing joined sub-documents; emails have attachments; office documents can embed other documents. Nearly all search domains have some form of relational content, often requiring more than one join.

If such content is so common then how do search applications handle it today?

One obvious "solution" is to simply use a relational database instead of a search engine! If relevance scores are less important and you need to do substantial joining, grouping, sorting, etc., then using a database could be best overall. Most databases include some form a text search, some even using Lucene.

If you still want to use a search engine, then one common approach is to denormalize the content up front, at index-time, by joining all tables and indexing the resulting rows, duplicating content in the process. For example, you'd index each song as a Lucene document, copying over all fields from the song's joined album and artist/band. This works correctly, but can be horribly wasteful as you are indexing identical fields, possibly including large text fields, over and over.

Another approach is to do the join yourself, outside of Lucene, by indexing songs, albums and artist/band as separate Lucene documents, perhaps even in separate indices. At search-time, you first run a query against one collection, for example the songs. Then you iterate through all hits, gathering up (joining) the full set of corresponding albums and then run a second query against the albums, with a large OR'd list of the albums from the first query, repeating this process if you need to join to artist/band as well. This approach will also work, but doesn't scale well as you may have to create possibly immense follow-on queries.

Yet another approach is to use a software package that has already implemented one of these approaches for you! elasticsearch, Apache Solr, Apache Jackrabbit, Hibernate Search and many others all handle relational content in some way.

With BlockJoinQuery you can now directly search relational content yourself!

Let's work through a simple example: imagine you sell shirts online. Each shirt has certain common fields such as name, description, fabric, price, etc. For each shirt you have a number of separate stock keeping units or SKUs, which have their own fields like size, color, inventory count, etc. The SKUs are what you actually sell, and what you must stock, because when someone buys a shirt they buy a specific SKU (size and color).

Maybe you are lucky enough to sell the incredible Mountain Three-wolf Moon Short Sleeve Tee, with these SKUs (size, color):
  • small, blue
  • small, black
  • medium, black
  • large, gray
Perhaps a user first searches for "wolf shirt", gets a bunch of hits, and then drills down on a particular size and color, resulting in this query:
   name:wolf AND size=small AND color=blue
which should match this shirt. name is a shirt field while the size and color are SKU fields.

But if the user drills down instead on a small gray shirt:
   name:wolf AND size=small AND color=gray
then this shirt should not match because the small size only comes in blue and black.

How can you run these queries using BlockJoinQuery? Start by indexing each shirt (parent) and all of its SKUs (children) as separate documents, using the new IndexWriter.addDocuments API to add one shirt and all of its SKUs as a single document block. This method atomically adds a block of documents into a single segment as adjacent document IDs, which BlockJoinQuery relies on. You should also add a marker field to each shirt document (e.g. type = shirt), as BlockJoinQuery requires a Filter identifying the parent documents.

To run a BlockJoinQuery at search-time, you'll first need to create the parent filter, matching only shirts. Note that the filter must use FixedBitSet under the hood, like CachingWrapperFilter:
  Filter shirts = new CachingWrapperFilter(
                    new QueryWrapperFilter(
                      new TermQuery(
                        new Term("type", "shirt"))));
Create this filter once, up front and re-use it any time you need to perform this join.

Then, for each query that requires a join, because it involves both SKU and shirt fields, start with the child query matching only SKU fields:
  BooleanQuery skuQuery = new BooleanQuery();
  skuQuery.add(new TermQuery(new Term("size", "small")), Occur.MUST);
  skuQuery.add(new TermQuery(new Term("color", "blue")), Occur.MUST);
Next, use BlockJoinQuery to translate hits from the SKU document space up to the shirt document space:
  BlockJoinQuery skuJoinQuery = new BlockJoinQuery(
The ScoreMode enum decides how scores for multiple SKU hits should be aggregated to the score for the corresponding shirt hit. In this query you don't need scores from the SKU matches, but if you did you can aggregate with Avg, Max or Total instead.

Finally you are now free to build up an arbitrary shirt query using skuJoinQuery as a clause:
  BooleanQuery query = new BooleanQuery();
  query.add(new TermQuery(new Term("name", "wolf")), Occur.MUST);
  query.add(skuJoinQuery, Occur.MUST);
You could also just run skuJoinQuery as-is if the query doesn't have any shirt fields.

Finally, just run this query like normal! The returned hits will be only shirt documents; if you'd also like to see which SKUs matched for each shirt, use BlockJoinCollector:
  BlockJoinCollector c = new BlockJoinCollector(
    Sort.RELEVANCE, // sort
    10,             // numHits
    true,           // trackScores
    false           // trackMaxScore
    );, c);
The provided Sort must use only shirt fields (you cannot sort by any SKU fields). When each hit (a shirt) is competitive, this collector will also record all SKUs that matched for that shirt, which you can retrieve like this:
  TopGroups hits = c.getTopGroups(
    0,   // offset
    10,  // maxDocsPerGroup
    0,   // withinGroupOffset
    true // fillSortFields
Set skuSort to the sort order for the SKUs within each shirt. The first offset hits are skipped (use this for paging through shirt hits). Under each shirt, at most maxDocsPerGroup SKUs will be returned. Use withinGroupOffset if you want to page within the SKUs. If fillSortFields is true then each SKU hit will have values for the fields from skuSort.

The hits returned by BlockJoinCollector.getTopGroups are SKU hits, grouped by shirt. You'd get the exact same results if you had denormalized up-front and then used grouping to group results by shirt.

You can also do more than one join in a single query; the joins can be nested (parent to child to grandchild) or parallel (parent to child1 and parent to child2).

However, there are some important limitations of index-time joins:
  • The join must be computed at index-time and "compiled" into the index, in that all joined child documents must be indexed along with the parent document, as a single document block.

  • Different document types (for example, shirts and SKUs) must share a single index, which is wasteful as it means non-sparse data structures like FieldCache entries consume more memory than they would if you had separate indices.

  • If you need to re-index a parent document or any of its child documents, or delete or add a child, then the entire block must be re-indexed. This is a big problem in some cases, for example if you index "user reviews" as child documents then whenever a user adds a review you'll have to re-index that shirt as well as all its SKUs and user reviews.

  • There is no QueryParser support, so you need to programmatically create the parent and child queries, separating according to parent and child fields.

  • The join can currently only go in one direction (mapping child docIDs to parent docIDs), but in some cases you need to map parent docIDs to child docIDs. For example, when searching songs, perhaps you want all matching songs sorted by their title. You can't easily do this today because the only way to get song hits is to group by album or band/artist.

  • The join is a one (parent) to many (children), inner join.
As usual, patches are welcome!

There is work underway to create a more flexible, but likely less performant, query-time join capability, which should address a number of the above limitations.