Saturday, May 21, 2011

The invisible Lucene bug fixed point

It turns out, the Jira issue tracking system, which we make heavy use of here at Apache, uses Lucene under the hood for searching and browsing issues. This is wonderful since it means Lucene developers are eating their own dog food whenever they use Jira.

Atlassian has opened up some doozy bugs over time, including one of the earliest bug numbers I've ever worked on, LUCENE-140. They sent me a t-shirt for fixing that one (thank you!).

Now, imagine this: what if there were a sneaky bug in Lucene, say a certain text fragment that causes an exception during indexing. A user opens an issue to report this, including the problematic text fragment, yet, because Jira uses Lucene, it hits an exception while indexing that fragment and causes this one bug to be un-searchable and un-viewable when browsing! An invisible bug fixed point.

It's somewhat mind bending to think about, Lucene recursing on itself through Jira, yet it's theoretically possible! Maybe we have a few of invisible bug fixed points lurking already and nobody knows...

Saturday, May 7, 2011

265% indexing speedup with Lucene's concurrent flushing

A week ago, I described the nightly benchmarks we use to catch any unexpected slowdowns in Lucene's performance. Back then the graphs were rather boring (a good thing), but, not anymore! Have a look at the stunning jumps in Lucene's indexing rate:



(Click through the image to see details about what changed on dates A, B, C and D).

Previously we were around 102 GB of plain text per hour, and now it's about 270 GB/hour. That's a 265% jump! Lucene now indexes all of Wikipedia's 23.2 GB (English) export in 5 minutes and 10 seconds.

How did this happen? Concurrent flushing.

That new feature, having lived on a branch for quite some time, undergoing many fun iterations, was finally merged back to trunk about a week ago.

Before concurrent flushing, whenever IndexWriter needed to flush a new segment, it would stop all indexing threads and hijack one thread to perform the rather compute intensive flush. This was a nasty bottleneck on computers with highly concurrent hardware; flushing was inherently single threaded. I previously described the problem here.

But with concurrent flushing, each thread freely flushes its own segment even while other threads continue indexing. No more bottleneck!

Note that there are two separate jumps in the graph. The first jump, the day concurrent flushing landed (labelled as B on the graph), shows the improvement while using only 6 threads and 512 MB RAM buffer during indexing. Those settings resulted in the fastest indexing rate before concurrent flushing.

The second jump (labelled as D on the graph) happened when I increased the indexing threads to 20 and dropped the RAM buffer to 350 MB, giving the fastest indexing rate after concurrent flushing.

One nice side effect of concurrent flushing is that you can now use RAM buffers well over 2.1 GB, as long as you use multiple threads. Curiously, I found that larger RAM buffers slow down overall indexing rate. This might be because of the discontinuity when closing IndexWriter, when we must wait for all the RAM buffers to be written to disk. It would be better to measure steady state indexing rate, while indexing an effectively infinite content source, and ignoring the startup and ending transients; I suspect if I measured that instead, we'd see gains from larger RAM buffers, but this is just speculation at this point.

There were some very challenging changes required to make concurrent flushing work, especially around how IndexWriter handles buffered deletes. Simon Willnauer does a great job describing these changes here and here. Concurrency is tricky!

Remember this change only helps you if you have concurrent hardware, you use enough threads for indexing and there's no other bottleneck (for example, in the content source that provides the documents). Also, if your IO system can't keep up then it will bottleneck your CPU concurrency. The nightly benchmark runs on a computer with 12 real (24 with hyperthreading) cores and a fast (OCZ Vertex 3) solid-state disk. Finally, this feature is not yet released: it was committed to Lucene's trunk, which will eventually be released as 4.0.