Tuesday, June 14, 2011

Near-real-time latency during large merges

I looked into the curious issue I described in my last post, where the NRT reopen delays can become "spikey" (take longer) during a large merge.

To show the issue, I modified the NRT test to kick off a background optimize on startup. This runs a single large merge, creating a 13 GB segment, and indeed produces spikey reopen delays (purple):

The large merge finishes shortly after 7 minutes, after which the reopen delays become healthy again. Search performance (green) is unaffected.

I also added Linux'd dirty bytes to the graph, as reported by /proc/meminfo; it's the saw-tooth blue/green series on the bottom. Note that it's divided by 10, to better fit the Y axis; the peaks are around 800-900 MB.

The large merge writes bytes a fairly high rate (around 30 MB/sec), but Linux buffers those writes in RAM, only actually flushing them to disk every 30 seconds; this is what produces the saw-tooth pattern.

From the graph you can see that the spikey reopen delays generally correlate to when Linux is flushing the dirty pages to disk. Apparently, this heavy write IO interferes with the read IO required when resolving deleted terms to document IDs. To confirm this, I ran the same stress test, but with only adds (no deletions); the reopen delays were then unaffected by the ongoing large merge.

So finally the mystery is explained, but, how to fix it?

I know I could tune Linux's IO, for example to write more frequently, but I'd rather find a Lucene-only solution since we can't expect most users to tune the OS.

One possibility is to make a RAM resident terms dictionary, just for primary-key fields. This could be very compact, for example by using an FST, and should give lookups that never hit disk unless the OS has frustratingly swapped out your RAM data structures. This can also be separately useful for applications that need fast document lookup by primary key, so someone should at some point build this.

Another, lower level idea is to simply rate limit byte/sec written by merges. Since big merges also impact ongoing searches, likely we could help that case as well. To try this out, I made a simple prototype (see LUCENE-3202), and then re-ran the same stress test, limiting all merging to 10 MB/sec:

The optimize now took 3 times longer, and the peak dirty bytes (around 300 MB) is 1/3rd as large, as expected since the IO write rate is limited to 10 MB/sec. But look at the reopen delays: they are now much better contained, averaging around 70 milliseconds while the optimize is running, and dropping to 60 milliseconds once the optimize finishes. I think the ability to limit merging IO is an important feature for Lucene!


  1. Are you certain that bytes/second is the issue and not IOPS/sec?

  2. No, I'm not certain: I haven't separately tested IOPS decoupled from bytes/sec. This would actually be fairly easy to do, by changing Lucene's read/write buffer size for the IndexInputs/Outputs we open during merging.