Changing Bits

Monday, September 26, 2011

Lucene's `SearcherManager` simplifies reopen with threads

Modern computers have wonderful hardware concurrency, within and across CPU cores, RAM and IO resources, which means your typical server-based search application should use multiple threads to fully utilize all resources.

For searching, this usually means you'll have one thread handle each search request, sharing a single IndexSearcher instance. This model is effective: the Lucene developers work hard to minimize internal locking in all Lucene classes. In fact, we recently removed thread contention during indexing (specifically, flushing), resulting in massive gains in indexing throughput on highly concurrent hardware.

Since IndexSearcher exposes a fixed, point-in-time view of the index, when you make changes to the index you'll need to reopen it. Fortunately, since version 2.9, Lucene has provided the IndexReader.reopen method to get a new reader reflecting the changes.

This operation is efficient: the new reader shares already warmed sub-readers in common with the old reader, so it only opens sub-readers for any newly created segments. This means reopen time is generally in proportion to how many changes you made; however, when a large merge had completed it will be longer. It's best to warm the new reader before putting it into production by running a set of "typical" searches for your application, so that Lucene performs one-time initialization for internal data structures (norms, field cache, etc.).

But how should you properly reopen, while search threads are still running and new searches are forever arriving? Your search application is popular, users are always searching and there's never a good time to switch! The core issue is that you must never close your old IndexReader while other threads are still using it for searching, otherwise those threads can easily hit cryptic exceptions that often mimic index corruption.

Lucene tries to detect that you've done this, and will throw a nice AlreadyClosedException, but we cannot guarantee that exception is thrown since we only check up front, when the search kicks off: if you close the reader when a search is already underway then all bets are off.

One simple approach would be to temporarily block all new searches and wait for all running searches to complete, and then close the old reader and switch to the new one. This is how janitors often clean a bathroom: they wait for all current users to finish and block new users with the all-too-familiar plastic yellow sign.

While the bathroom cleaning approach will work, it has an obviously serious drawback: during the cutover you are now forcing your users to wait, and that wait time could be long (the time for the slowest currently running search to finish).

A much better solution is to immediately direct new searches to the new reader, as soon as it's done warming, and then separately wait for the still-running searches against the old reader to complete. Once the very last search has finished with the old reader, close it.

This solution is fully concurrent: it has no locking whatsoever so searches are never blocked, as long as you use a separate thread to perform the reopen and warming. The time to reopen and warm the new reader has no impact on ongoing searches, except to the extent that reopen consumes CPU, RAM and IO resources to do its job (and, sometimes, this can in fact interfere with ongoing searches).

So how exactly do you implement this approach? The simplest way is to use the reference counting APIs already provided by IndexReader to track how many threads are currently using each searcher. Fortunately, as of Lucene 3.5.0, there will be a new contrib/misc utility class, SearcherManager, originally created as an example for Lucene in Action, 2nd edition, that does this for you! (LUCENE-3445 has the details.)

The class is easy to use. You first create it, by providing the Directory holding your index and a SearchWarmer instance:

  class MySearchWarmer implements SearchWarmer {
    @Override
    public void warm(IndexSearcher searcher) throws IOException {
      // Run some diverse searches, searching and sorting against all
      // fields that are used by your application
    }
  }

  Directory dir = FSDirectory.open(new File("/path/to/index"));
  SearcherManager mgr = new SearcherManager(dir,
                                            new MySearchWarmer());

Then, for each search request:

  IndexSearcher searcher = mgr.acquire();
  try {
    // Do your search, including loading any documents, etc.
  } finally {
    mgr.release(searcher);

    // Set to null to ensure we never again try to use
    // this searcher instance after releasing:
    searcher = null;
}

Be sure you fully consume searcher before releasing it! A common mistake is to release it yet later accidentally use it again to load stored documents, for rendering the search results for the current page.

Finally, you'll need to periodically call the maybeReopen method from a separate (ie, non-searching) thread. This method will reopen the reader, and only if there was actually a change will it cutover. If your application knows when changes have been committed to the index, you can reopen right after that. Otherwise, you can simply call maybeReopen every X seconds. When there has been no change to the index, the cost of maybeReopen is negligible, so calling it frequently is fine.

Beware the potentially high transient cost of reopen and warm! During reopen, as you must have two readers open until the old one can be closed, you should budget plenty of RAM in the computer and heap for the JVM, to comfortably handle the worst case when the two readers share no sub-readers (for example, after a full optimize) and thus consume 2X the RAM of a single reader. Otherwise you might hit a swap storm or OutOfMemoryError, effectively taking down entire whole search application. Worse, you won't see this problem early on: your first few hundred reopens could easily use only small amounts of added heap, but then suddenly on some unexpected reopen the cost is far higher. Reopening and warming is also generally IO intensive as the reader must load certain index data structures into memory.

Next time I'll describe another utility class, NRTManager, available since version 3.3.0, that you should use instead if your application uses Lucene's fast-turnaround near-real-time (NRT) search. This class solves the same problem (thread-safety during reopening) as SearcherManager but adds a fun twist as it gives you more specific control over which changes must be visible in the newly opened reader.

Thursday, June 30, 2011

Primary key lookups are 2.8X faster with MemoryCodec

A few days ago I committed the new MemoryCodec to Lucene's trunk (to be 4.0). This codec indexes all terms and postings into a compact finite-state transducer (FST) and then, at search time, avoids I/O by performing all terms and postings enumerations in memory using the FST.

If your application needs fast primary-key lookups, and you can afford the required additional memory, this codec might be a good match for the id field. To test this, I switched Lucene's nightly benchmark to use MemoryCodec (just for its id field), and performance jumped from around 179 K to 509 K lookups per second:

This is an awesome improvement! It's particularly impressive as the id field was previously indexed using PulsingCodec, which was already faster than the default StandardCodec.

This is the performance for a single thread, and should scale up linearly if you use multiple threads. Each lookup resolves 4,000 keys in order at once from the id field, performing the lookups segment by segment for best performance (see the source code). The index has 27.6 M docs across multiple segments.

Of course, there is added memory required, specifically 188 MB for this index, which works out to 7.1 bytes per document on average.

There are two sources of MemoryCodec's gains. First, the obvious one: since everything is in memory, you never wait for an I/O seek operation, as long as you ensure the sneaky OS never swaps out your process memory.

Second, I separately added a new seekExact API to TermsEnum, enabling codecs to save CPU if the caller does not need to know the following term when the target term doesn't exist, as is the case here. MemoryCodec has an optimized implementation for seekExact (and so does the cool SimpleTextCodec!). Eventually other codecs should as well, by using the block tree terms index, but we're not there yet.

The id field in the nightly benchmark omits term freq and positions, however MemoryCodec is fully general: you can use it for any field (not just primary-key), storing positions, payloads, etc. Also, its values are zero-padded sequential integers (00000001, 00000002, 00000003, etc.), which is likely important for performance as it allows maximal sharing in the FST. I haven't tested but I suspect had I used something more random, such as GUIDs, memory usage would be higher and lookup performance worse as each segment's FST would be less dense (share less).

Of course, Lucene is not a database, and you normally use it for its fast search performance, not primary-key lookups. The one common search use case where you do require primary-key lookups is during indexing, when deleting or updating documents by an id field. Near-realtime search with updates or deletions relies on this, since the deleted documents must be resolved during reopen, so we also see a healthy speedup in the NRT reopen time:

The NRT latencey dropped from around 52 milliseconds to 43 milliseconds, a 17% improvement. This is "only" 17% because opening a new reader must also do other things like flush the indexed documents as a new segment.

Perhaps more importantly, the variance also dropped substantially, which is expected because with MemoryCodec and NRTCachingDirectory, NRT reopen is fully I/O free (performs no reads or writes when opening a new reader).

One limitation of MemoryCodec is it's an all-or-nothing deal: all terms and postings are in memory, or they aren't. LUCENE-3069, still to be done (any volunteers?), aims to fix this, by enabling you to separately choose whether terms and/or postings data should be in memory.

I suspect an even more specialized codec, for example one that requires the field values to be compact integers, and also requires that the values are unique (only supports primary-key fields), could do even better than MemoryCodec by storing the mapping in global (across all segments) parallel arrays. Such a codec would no longer be general; it'd only work for primary-key fields whose values are compact integers. But it'd have faster lookups than MemoryCodec and should use less memory per document. This codec could simply wrap any other codec, i.e. it would create the arrays on reader initialization, and delegate persisting the postings into the index to the wrapped codec.

Tuesday, June 14, 2011

Near-real-time latency during large merges

I looked into the curious issue I described in my last post, where the NRT reopen delays can become "spikey" (take longer) during a large merge.

To show the issue, I modified the NRT test to kick off a background optimize on startup. This runs a single large merge, creating a 13 GB segment, and indeed produces spikey reopen delays (purple):

The large merge finishes shortly after 7 minutes, after which the reopen delays become healthy again. Search performance (green) is unaffected.

I also added Linux'd dirty bytes to the graph, as reported by /proc/meminfo; it's the saw-tooth blue/green series on the bottom. Note that it's divided by 10, to better fit the Y axis; the peaks are around 800-900 MB.

The large merge writes bytes a fairly high rate (around 30 MB/sec), but Linux buffers those writes in RAM, only actually flushing them to disk every 30 seconds; this is what produces the saw-tooth pattern.

From the graph you can see that the spikey reopen delays generally correlate to when Linux is flushing the dirty pages to disk. Apparently, this heavy write IO interferes with the read IO required when resolving deleted terms to document IDs. To confirm this, I ran the same stress test, but with only adds (no deletions); the reopen delays were then unaffected by the ongoing large merge.

So finally the mystery is explained, but, how to fix it?

I know I could tune Linux's IO, for example to write more frequently, but I'd rather find a Lucene-only solution since we can't expect most users to tune the OS.

One possibility is to make a RAM resident terms dictionary, just for primary-key fields. This could be very compact, for example by using an FST, and should give lookups that never hit disk unless the OS has frustratingly swapped out your RAM data structures. This can also be separately useful for applications that need fast document lookup by primary key, so someone should at some point build this.

Another, lower level idea is to simply rate limit byte/sec written by merges. Since big merges also impact ongoing searches, likely we could help that case as well. To try this out, I made a simple prototype (see LUCENE-3202), and then re-ran the same stress test, limiting all merging to 10 MB/sec:

The optimize now took 3 times longer, and the peak dirty bytes (around 300 MB) is 1/3rd as large, as expected since the IO write rate is limited to 10 MB/sec. But look at the reopen delays: they are now much better contained, averaging around 70 milliseconds while the optimize is running, and dropping to 60 milliseconds once the optimize finishes. I think the ability to limit merging IO is an important feature for Lucene!

Tuesday, June 7, 2011

Lucene's near-real-time search is fast!

Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time. In some cases, such as modern social/news sites (e.g., LinkedIn, Twitter, Facebook, Stack Overflow, Hacker News, DZone, etc.), fast turnaround time is a hard requirement.

Fortunately, it's trivial to use. Just open your initial NRT reader, like this:
// w is your IndexWriter IndexReader r = IndexReader.open(w, true);
(That's the 3.1+ API; prior to that use w.getReader() instead).

The returned reader behaves just like one opened with IndexReader.open: it exposes the point-in-time snapshot of the index as of when it was opened. Wrap it in an IndexSearcher and search away!

Once you've made changes to the index, call r.reopen() and you'll get another NRT reader; just be sure to close the old one.

What's special about the NRT reader is that it searches uncommitted changes from IndexWriter, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often commit is called), something not previously possible.

Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes. The reopen time is in proportion to how many changes you made since last reopening that reader.

Lucene's approach is a nice compromise between immediate consistency, where changes are visible after each index change, and eventual consistency, where changes are visible "later" but you don't usually know exactly when.

With NRT, your application has controlled consistency: you decide exactly when changes must become visible.

Recently there have been some good improvements related to NRT:

New default merge policy, TieredMergePolicy, which is able to select more efficient non-contiguous merges, and favors segments with more deletions.
NRTCachingDirectory takes load off the IO system by caching small segments in RAM (LUCENE-3092).
When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (LUCENE-2900).
Segments that are 100% deleted are now dropped instead of inefficiently merged (LUCENE-2010).

How fast is NRT search?

I created a simple performance test to answer this. I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents.

Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text. This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as Twitter's recent peak during this year's superbowl (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard.

The test uses updateDocument, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments. In addition, 8 search threads run a fixed TermQuery at the same time.

Finally, the NRT reader is reopened once per second.

I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an OCZ Vertex 3 240 GB SSD, using Oracle's 64 bit Java 1.6.0_21 and Linux Fedora 13. I gave Java a 2 GB max heap, and used MMapDirectory.

The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time:

The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive Dygraph, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph.

Search QPS dropped substantially with time. While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching. They are only truly deleted when the segments are merged. TermQuery is a worst-case query; harder queries, such as BooleanQuery, should see less slowdown from deleted, but not reclaimed, documents.

Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped. It looks like TieredMergePolicy should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off. Once it finished the QPS recovered somewhat.

Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical.

Reopen delay during merging

The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate). There are random single spikes, which is caused by Java running a full GC cycle. However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40). Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay. Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching.

I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities. It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen. However, if that were the case, NRTCachingDirectory (used by the test) should have prevented it, but didn't. It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've set Linux's swappiness to 0.

Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's pdflush tunables could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness.

Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling Directory implementations to pass appropriate flags to the OS when opening files for merging (LUCENE-2793 and LUCENE-2795). From past testing I know that passing O_DIRECT can prevent merges from evicting hot pages, so it's possible this will fix our slow reopen time as well since it bypasses the write cache.

Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges.

This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it. It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity. This would make a nice addition to Lucene!

Except for the slowdown during merging, the performance of NRT is impressive. Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough.

While there are exciting ideas to bring true real-time search to Lucene, by directly searching IndexWriter's RAM buffer as Michael Busch has implemented at Twitter with some cool custom extensions to Lucene, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT.

NIOFSDirectory vs MMapDirectory

Out of curiosity, I ran the exact same test as above, but this time with NIOFSDirectory instead of MMapDirectory:

There are some interesting differences. The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot. For some reason TermQuery, in particular, has high variance from one JVM instance to another.

The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur). The takeaway message seems clear: on Linux, use MMapDirectory not NIOFSDirectory!

Optimizing your NRT turnaround time

My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text). You should test specifically for your use-case what reopen rate works best. Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate.

Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements.

There are also some simple steps you can take to reduce the turnaround time:

Store the index on a fast IO system, ideally a modern SSD.
Install a merged segment warmer (see IndexWriter.setMergedSegmentWarmer). This warmer is invoked by IndexWriter to warm up a newly merged segment without blocking the reopen of a new NRT reader. If your application uses Lucene's FieldCache or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader.
Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices. The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk).
If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the Pulsing codec for your id field since this gives faster lookup performance which will make your reopen faster.
Use the new NRTCachingDirectory, which buffers small segments in RAM to take load off the IO system (LUCENE-3092).
Pass false for applyDeletes when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader.
While it's not clear that thread priorities actually work correctly (see this Google Tech Talk), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads. If the machine becomes saturated, ideally only the search threads should take the hit.

Happy near-real-time searching!

Saturday, May 21, 2011

The invisible Lucene bug fixed point

It turns out, the Jira issue tracking system, which we make heavy use of here at Apache, uses Lucene under the hood for searching and browsing issues. This is wonderful since it means Lucene developers are eating their own dog food whenever they use Jira.

Atlassian has opened up some doozy bugs over time, including one of the earliest bug numbers I've ever worked on, LUCENE-140. They sent me a t-shirt for fixing that one (thank you!).

Now, imagine this: what if there were a sneaky bug in Lucene, say a certain text fragment that causes an exception during indexing. A user opens an issue to report this, including the problematic text fragment, yet, because Jira uses Lucene, it hits an exception while indexing that fragment and causes this one bug to be un-searchable and un-viewable when browsing! An invisible bug fixed point.

It's somewhat mind bending to think about, Lucene recursing on itself through Jira, yet it's theoretically possible! Maybe we have a few of invisible bug fixed points lurking already and nobody knows...

Saturday, May 7, 2011

265% indexing speedup with Lucene's concurrent flushing

A week ago, I described the nightly benchmarks we use to catch any unexpected slowdowns in Lucene's performance. Back then the graphs were rather boring (a good thing), but, not anymore! Have a look at the stunning jumps in Lucene's indexing rate:

(Click through the image to see details about what changed on dates A, B, C and D).

Previously we were around 102 GB of plain text per hour, and now it's about 270 GB/hour. That's a 265% jump! Lucene now indexes all of Wikipedia's 23.2 GB (English) export in 5 minutes and 10 seconds.

How did this happen? Concurrent flushing.

That new feature, having lived on a branch for quite some time, undergoing many fun iterations, was finally merged back to trunk about a week ago.

Before concurrent flushing, whenever IndexWriter needed to flush a new segment, it would stop all indexing threads and hijack one thread to perform the rather compute intensive flush. This was a nasty bottleneck on computers with highly concurrent hardware; flushing was inherently single threaded. I previously described the problem here.

But with concurrent flushing, each thread freely flushes its own segment even while other threads continue indexing. No more bottleneck!

Note that there are two separate jumps in the graph. The first jump, the day concurrent flushing landed (labelled as B on the graph), shows the improvement while using only 6 threads and 512 MB RAM buffer during indexing. Those settings resulted in the fastest indexing rate before concurrent flushing.

The second jump (labelled as D on the graph) happened when I increased the indexing threads to 20 and dropped the RAM buffer to 350 MB, giving the fastest indexing rate after concurrent flushing.

One nice side effect of concurrent flushing is that you can now use RAM buffers well over 2.1 GB, as long as you use multiple threads. Curiously, I found that larger RAM buffers slow down overall indexing rate. This might be because of the discontinuity when closing IndexWriter, when we must wait for all the RAM buffers to be written to disk. It would be better to measure steady state indexing rate, while indexing an effectively infinite content source, and ignoring the startup and ending transients; I suspect if I measured that instead, we'd see gains from larger RAM buffers, but this is just speculation at this point.

There were some very challenging changes required to make concurrent flushing work, especially around how IndexWriter handles buffered deletes. Simon Willnauer does a great job describing these changes here and here. Concurrency is tricky!

Remember this change only helps you if you have concurrent hardware, you use enough threads for indexing and there's no other bottleneck (for example, in the content source that provides the documents). Also, if your IO system can't keep up then it will bottleneck your CPU concurrency. The nightly benchmark runs on a computer with 12 real (24 with hyperthreading) cores and a fast (OCZ Vertex 3) solid-state disk. Finally, this feature is not yet released: it was committed to Lucene's trunk, which will eventually be released as 4.0.

Friday, April 29, 2011

Catching slowdowns in Lucene

Lucene has great randomized tests to catch functional failures, but when we accidentally commit a performance regression (we slow down indexing or searching), nothing catches us!

This is scary, because we want things to get only faster with time.

So, when there's a core change that we think may impact performance, we run before/after tests to verify. But this is ad-hoc and error-proned: we could easily forget to do this, or fail to anticipate that a code change might have a performance impact.

Even when we do test performance of a change, the slowdown could be relatively small, easily hiding within the unfortunately often substantial noise of our tests. Over time we might accumulate many such small, unmeasurable slowdowns, suffering the fate of the boiling frog. We do also run performance tests before releasing, but it's better to catch them sooner: solving slowdowns just before releasing is.... dangerous.

To address this problem, I've created a script that runs standard benchmarks on Lucene's trunk (to be 4.0), nightly. It indexes all of Wikipedia's English XML export, three times (with different settings and document sizes), runs a near-real-time (NRT) turnaround time test for 30 minutes, and finally a diverse set of hard queries.

This has been running for a few weeks now, and the results are accessible to anyone.

It's wonderful to see that Lucene's indexing throughput is already a bit faster (~98 GB plain text per hour) than when I last measured!

Near-real-time reopen latency is here; the test measures how long it takes (on average, after discarding outliers) to open a new NRT reader. It's quite intensive, indexing around 1 MB plain text per second as updates (delete+addDocument), and reopening once per second, on the full previously built Wikipedia index.

To put this in perspective, that's almost twice Twitter's recent peak indexing rate during the 2011 Superbowl (4,064 Tweets/second), although Twitter's use-case is harder because the documents are much smaller, and presumably there's additional indexed metadata beyond just the text of the Tweet. Twitter has actually implemented some cool changes to Lucene to enable real-time searching without reopening readers; Michael Busch describes them here and here. Some day I hope these will be folded into Lucene!

Finally, we test all sorts of queries: PhraseQuery (exact and sloppy), FuzzyQuery (edit distance 1 and 2), four variants of BooleanQuery, NumericRangeQuery, PrefixQuery, WildcardQuery, SpanNearQuery, and of course TermQuery. In addition we test the automaton spell checker, and primary-key lookup.

A few days ago, I switched all tests to the very fast 240 GB OCZ Vertex 3 (previously it was a traditional spinning-magnets hard drive). It looks like indexing throughput gained a bit of performance (~102 GB plain text per hour), the search performance was unaffected (expected, because for this test all postings easily fit in available RAM), but the NRT turnaround time saw a drastic reduction in the noise to near-zero. NRT is very IO intensive so it makes sense having a fast IO system improves its turnaround time; I need to dig further into this.

Unfortunately, performance results are inherently noisy. For example you can see the large noise (the error band is +/- one standard deviation) in the TermQuery results; other queries seem to have less noise for some reason.

So far the graphs are rather boring: nice and flat. This is a good thing!

Sunday, April 24, 2011

Just say no to swapping!

Imagine you love to cook; it's an intense hobby of yours. Over time, you've accumulated many fun spices, but your pantry is too small, so, you rent an off-site storage facility, and move the less frequently used spice racks there. Problem solved!

Suddenly you decide to cook this great new recipe. You head to the pantry to retrieve your Saffron, but it's not there! It was moved out to the storage facility and must now be retrieved (this is a hard page fault).

No problem -- your neighbor volunteers to go fetch it for you. Unfortunately, the facility is ~2,900 miles away, all the way across the US, so it takes your friend 6 days to retrieve it!

This assumes you normally take 7 seconds to retrieve a spice from the pantry; that your data was in main memory (~100 nanoseconds access time), not in the CPU's caches (which'd be maybe 10 nanoseconds); that your swap file is on a fast (say, WD Raptor) spinning-magnets hard drive with 5 millisecond average access time; and that your neighbor drives non-stop at 60 mph to the facility and back.

Even worse, your neighbor drives a motorcycle, and so he can only retrieve one spice rack at a time. So, after waiting 6 days for the Saffron to come back, when you next go to the pantry to get some Paprika, it's also "swapped out" and you must wait another 6 days! It's possible that first spice rack also happened to have the Paprika but it's also likely it did not; that depends on your spice locality. Also, with each trip, your neighbor must pick a spice rack to move out to the facility, so that the returned spice rack has a place to go (it is a "swap", after all), so the Paprika could have just been swapped out!

Sadly, it might easily be many weeks until you succeed in cooking your dish.

Maybe in the olden days, when memory itself was a core of little magnets, swapping cost wasn't so extreme, but these days, as memory access time has improved drastically while hard drive access time hasn't budged, the disparity is now unacceptable. Swapping has become a badly leaking abstraction. When a typical process (say, your e-mail reader) has to "swap back in" after not being used for a while, it can hit 100s of such page faults, before finishing redrawing its window. It's an awful experience, though it has the fun side effect of letting you see, in slow motion, just what precise steps your email reader goes through when redrawing its window.

Swapping is especially disastrous with JVM processes. See, the JVM generally won't do a full GC cycle until it has run out of its allowed heap, so most of your heap is likely occupied by not-yet-collected garbage. Since these pages aren't being touched (because they are garbage and thus unreferenced), the OS happily swaps them out. When GC finally runs, you have a ridiculous swap storm, pulling in all these pages only to then discover that they are in fact filled with garbage and should be discarded; this can easily make your GC cycle take many minutes!

It'd be better if the JVM could work more closely with the OS so that GC would somehow run on-demand whenever the OS wants to start swapping so that, at least, we never swap out garbage. Until then, make sure you don't set your JVM's heap size too large!

Just use an SSD...

These days, many machines ship with solid state disks, which are an astounding (though still costly) improvement over spinning magnets; once you've used an SSD you can never go back; it's just one of life's many one-way doors.

You might be tempted to declare that this problem is solved, since SSDs are so blazingly fast, right? Indeed, they are orders of magnitudes faster than spinning magnets, but they are still 2-3 orders of magnitude slower than main memory or CPU cache. The typical SSD might have 50 microsends access time, which equates to ~58 total miles of driving at 60 mph. Certainly a huge improvement, but still unacceptable if you want to cook your dish on time!

Just add RAM...

Another common workaround is to put lots of RAM in your machine, but this can easily back-fire: operating systems will happily swap out memory pages in favor of caching IO pages, so if you have any processes accessing lots of bytes (say, mencoder encoding a 50 GB bluray movie, maybe a virus checker or backup program, or even Lucene searching against a large index or doing a large merge), the OS will swap your pages out. This then means that the more RAM you have, the more swapping you get, and the problem only gets worse!

Fortunately, some OS's let you control this behavior: on Linux, you can tune swappiness down to 0 (most Linux distros default this to a highish number); Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.

There are low-level IO flags that these programs are supposed to use so that the OS knows not to cache the pages they access, but sometimes the processes fail to use them or cannot use them (for example, they are not yet exposed to Java), and even if they do, sometimes the OS ignores them!

When swapping is OK

If your computer never runs any interactive processes, ie, a process where a human is blocked (waiting) on the other end for something to happen, and only runs batch processes which tend to be active at different times, then swapping can be an overall win since it allows that process which is active to make nearly-full use of the available RAM. Net/net, over time, this will give greater overall throughput for the batch processes on the machine.

But, remember that the server running your web-site is an interactive process; if your server processes (web/app server, database, search server, etc.) are stuck swapping, your site has for all intents and purposes become unusable to your users.

This is a fixable problem

Most processes have known data structures that consume substantial RAM, and in many cases these processes could easily discard and later regenerate their data structures in much less time than even a single page fault. Caches can simply be pruned or discarded since they will self-regenerate over time.

These data structures should never be swapped out, since regeneration is far cheaper. Somehow the OS should ask each RAM-intensive and least-recently-accessed process to discard its data structures to free up RAM, instead of swapping out the pages occupied by the data structure. Of course, this would require a tighter interaction between the OS and processes than exists today; Java's SoftReference is close, except this only works within a single JVM, and does not interact with the OS.

What can you do?

Until this problem is solved for real, the simplest workaround is to disable swapping entirely, and stuff as much RAM as you can into the machine. RAM is cheap, memory modules are dense, and modern motherboards accept many modules. This is what I do.

Of course, with this approach, when you run out of RAM stuff will start failing. If the software is well written, it'll fail gracefully: your browser will tell you it cannot open a new window or visit a new page. If it's poorly written it will simply crash, thus quickly freeing up RAM and hopefully not losing any data or corrupting any files in the process. Linux takes the simple draconian approach of picking a memory hogging process and SIGKILL'ing it.

If you don't want to disable swapping you should at least tell the OS not to swap pages out for IO caching.

Just say no to swapping!

Thursday, March 31, 2011

A login-wall is nearly as bad as a pay-wall!

Much has been said and asked about the differences between Stack Overflow and Quora.

And, while there are deep and interesting differences, such as how Stack Overflow makes reputation tracking and badges explicit, in my opinion, one simple difference is the most important of all: Quora's login-wall.

See, you cannot do anything with Quora until you've registered, while with Stack Overflow you can do almost everything without registering. They are polar opposites!

Like everyone else, I have too much curiosity and too little time. I try to keep up on Hacker News (sorry Digg and Reddit): I click through to the cool stuff, and then move on. You have one precious first page impression to rope me in, so don't spend that impression with a login-wall!

I mean, sure, I'm still going to go link up my Facebook account so I can login to Quora and see the questions, answers, conversations. (And, yes, Facebook seems to be winning at the "universal ID" game, even though I like OpenID better.) Still, for each persistent user like me, you've lost 9 non-persistent ones with that dreaded login-wall.

Remember: if you are are a new cool Web site, gaining value from the network effect (as all social sites do), trying to eek out just a tiny slice of all these fickle users jumping around out here, don't put up a login-wall! It's just about as bad as a paywall. Let brand new users do as much as possible with your site, and make that very first page impression count.

Saturday, March 26, 2011

Your test cases should sometimes fail!

I'm an avid subscriber of the delightful weekly (sometimes) Python-URL! email, highlighting the past week's interesting discussions across the numerous Python lists. Each summary starts with the best quote from the week; here's last week's quote:

"So far as I know, that actually just means that the test suite is insufficient." - Peter Seebach, when an application passes all its tests.

I wholeheartedly agree: if your build always passes its tests, that means your tests are not tough enough! Ideally the tests should stay ahead of the software, constantly pulling you forwards to improve its quality. If the tests keep passing, write new ones that fail! Or make existing ones evil-er.

You'll be glad to know that Lucene/Solr's tests do sometimes fail, as you can see in the ~~Hudson~~ Jenkins automated trunk builds.

Randomized testing

Our test infrastructure has gotten much better, just over the past 6 months or so, through heavy use of randomization.

When a test needs a Directory instance, but doesn't care which, it uses the newDirectory method. This method picks one of Lucene's Directory implementations (RAMDirectory, NIOFSDirectory, MMapDirectory, etc.) and then wraps it with MockDirectoryWrapper, a nice little class that does all sorts of fun things like: occasionally calling Thread.yield; preventing still-open files from being overwritten or deleted (acts-like-Windows); refusing to write to the same file twice (verifying Lucene is in fact write-once); breaking up a single writeBytes into multiple calls; optionally throwing IOException on disk full, or simply throwing exceptions at random times; simulating an OS/hardware crash by randomly corrupting un-sync'd files in devilish ways; etc. We pick a timezone and locale.

To randomize indexing, we create a IndexWriterConfig, tweaking all sorts of settings, and use RandomIndexWriter (like IndexWriter, except it sometimes optimizes, commits, yields, etc.). The newField method enables or disables stored fields and term vectors. We create random codecs, per field, by combining a terms dictionary with a random terms index and postings implementations. MockAnalyzer injects payloads into its tokens.

Sometimes we use the PreFlex codec, to writes all indices in the 3.x format (so that we test index backwards compatibility), and sometimes the nifty SimpleText codec. We have exotic methods for creating random yet somewhat realistic full Unicode strings. When creating an IndexSearcher, we might use threads (pass an ExecutorService), or not. We catch tests that leave threads running, or that cause insanity in the FieldCache (for example by loading both parent and sub readers).

Reproducibility

To ensure a failure is reproducible, we save the random seeds and on a failure print out a nice line like this:

NOTE: reproduce with: ant test -Dtestcase=TestFieldCacheTermsFilter -Dtestmethod=testMissingTerms -Dtests.seed=-1046382732738729184:5855929314778232889

This fixes the seed so that the test runs deterministically. Sometimes, horribly, we have bugs in this seed logic, thus causing tests to not run deterministically and we scramble to fix those bugs first!

If you happen to hit a test failure, please send that precious line to the dev list! This is like the Search for Extraterrestrial Intelligence (SETI): there are some number of random seeds out there (hopefully, not too many!), that will lead to a failure, and if your computer is lucky enough to discover one of these golden seeds, please share the discovery!

The merging of Lucene and Solr's development was also a big step forward for test coverage, since every change in Lucene is now tested against all of Solr's test cases as well.

Tests accept a multiplier to crank things up, causing them to use more test documents or iterations, run for longer time, etc. We now have perpetual jobs on Jenkins, for both 3.x and trunk, launching every 15 minutes with multiplier 5. We know quickly when someone breaks the build!

This added test coverage has already caught a number of sneaky bugs (including a rare index corruption case on disk-full and a chunking bug in MMapDirectory) that we otherwise would not have discovered for some time.

The test infrastructure itself is so useful that it's now been factored out as a standalone JAR so apps using Lucene can tap into it to create their own fun randomized tests.

Thursday, March 24, 2011

Lucene's FuzzyQuery is 100 times faster in 4.0

There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.

The journey begins

The long journey began when Robert Muir had the idea of pre-building a Levenshtein Automaton, a deterministic automaton (DFA) that accepts only the terms within edit distance N. Doing this, up front, and then intersecting that automaton with the terms in the index, should give a massive speedup, he reasoned.

At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to N insertions, deletions and substitutions. But, unfortunately, just building that DFA (let alone then intersecting it with the terms in the index), was too slow.

Fortunately, after some Googling, he discovered a paper, by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr committers!) detailing an efficient algorithm for building the Levenshtein Automaton from a given base term and max edit distance. All he had to do is code it up! It's just software after all. Somehow, he roped Mark Miller, another Lucene/Solr committer, into helping him do this.

Unfortunately, the paper was nearly unintelligible! It's 67 pages, filled with all sorts of equations, Greek symbols, definitions, propositions, lemmas, proofs. It uses scary concepts like Subsumption Triangles, along with beautiful yet still unintelligible diagrams. Really the paper may as well have been written in Latin.

Much coffee and beer was consumed, sometimes simultaneously. Many hours were spent on IRC, staying up all night, with Mark and Robert carrying on long conversations, which none of the rest of us could understand, trying desperately to decode the paper and turn it into Java code. Weeks went by like this and they actually had made some good initial progress, managing to loosely crack the paper to the point where they had a test implementation of the N=1 case, and it seemed to work. But generalizing that to the N=2 case was... daunting.

The breakthrough

Then, finally, a breakthrough! Robert found, after even more Googling, an existence proof, in an unexpected place: an open-source package, Moman, under the generous MIT license. The author, Jean-Phillipe Barrette-LaPierre, had somehow, incredibly, magically, quietly, implemented the algorithm from this paper. And this was apparently a random side project for him, unrelated to his day job. So now we knew it was possible (and we all have deep admiration for Jean-Phillipe!).

We decided to simply re-use Moman's implementation to accomplish our goals. But, it turns out, its source code is all Python (my favorite programming language)! And, nearly as hairy as the paper itself. Nevertheless, we pushed on.

Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code, to auto-generate Java code containing the necessary tables for each max edit distance case (N=1, N=2, etc.). We had to guess what each Python function did, by its name, trying to roughly match this up to the spooky terminology in the paper.

The result was createLevAutomata.py: it auto-generates crazy looking Java code (see Lev2ParametricDescription.java, and scroll to the cryptic packed tables at the bottom), which in turn is used by further Java code to create the Levenshtein automaton per-query. We only generate the N=1 and N=2 cases (the N>=3 cases aren't really practical, at least not yet).

The last bug...

Realize, now, what a crazy position we were in. We wrote our own scary Python code, tapping into various functions in the Moman package, to auto-generate unreadable Java code with big tables of numbers, which is then used to generate Levenshtein automata from the base term and N. We went through many iterations with this crazy chain of Python and Java code that we barely understood, slowly iterating to get the bugs out.

After fixing many problems, we still had one persistent bug which we just couldn't understand, let alone fix. We struggled for several days, assuming the bug was in our crazy Python/Java chain. Finally, we considered the possibility that the bug was in Moman, and indeed Robert managed to reduce the problem to a tiny Python-only case showing where Moman failed to match the right terms. Robert sent this example to Jean-Phillipe, who quickly confirmed the bug and posted a patch the next day. We applied his patch and suddenly everything was working perfectly!

Fortunately, while this fast FuzzyQuery was unbelievably hairy to implement, testing it well is relatively easy since we can validate it against the brute-force enumeration from 3.0. We have several tests verifying the different layers executed by the full FuzzyQuery. The tests are exhaustive in that they test all structurally different cases possible in the Levenshtein construction, using a binary (only characters 0 and 1) terms.

Beyond just solving this nearly impossible task of efficiently compiling a term to a Levenshtein Automaton, we had many other parts to fill in. For example, Robert separately created a general AutomatonQuery, re-using infrastructure from the open-source Brics automaton package, to enable fast intersection of an automaton against all terms and documents in the index. This query is now used to handle WildcardQuery, RegexpQuery, and FuzzyQuery. It's also useful for custom cases, too; for example it's used by Solr to reverse wildcard queries. These slides from Robert describe AutomatonQuery, and its fun possible use case, in more detail.

Separately, we had an impedance mismatch: these automatons speak full unicode (UTF32) characters, yet Lucene's terms are stored in UTF8 bytes, so we had to create a UTF32 -> UTF8 automaton converter, which by itself was also very hairy! That converter translates any UTF32 automaton into an equivalent UTF8 Levenshtein automaton, which can be directly intersected against the terms in the index.

So, today, when you run a FuzzyQuery in 4.0, it efficiently seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton. This, coupled with ongoing performance improvements to seeking and scanning terms, as well as other major improvements like performing MultiTermQuery rewrites per-segment, has given us the astounding overall gains in FuzzyQuery.

Thanks to these enormous performance improvements, Robert has created an entirely new automaton spell checker that uses this same algorithm to find candidate terms for respelling. This is just like FuzzyQuery, except it doesn't visit the matching documents. This is a big improvement over the existing spellchecker as it does not require a separate spellchecker index be maintained.

This whole exciting experience is a great example of why open-source development works so well. Here we have diverse committers from Lucene/Solr, bringing together their various unusual strengths (automatons, Unicode, Python, etc.) to bear on an insanely hard challenge, leveraging other potent open-source packages including Moman and Brics, iterating with the authors of these packages to resolve bugs. No single person involved in this really understands all of the parts; it's truly a team effort.

And now you know what's going on under the hood when you see incredible speedups with FuzzyQuery in 4.0!

[For the not-faint-of-heart, you can browse LUCENE-1606 to see parts of this story unfolding through Jira]

Friday, March 11, 2011

Hack on Lucene this summer!

Are you a student? Looking to do some fun coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has these initial
potential projects identified, but you can also pick your own; just be sure to discuss with the community first (send an email to dev@lucene.apache.org).