If you've ever wondered how Lucene picks segments to merge during indexing, it looks something like this:
That video displays segment merges while indexing the entire Wikipedia (English) export (29 GB plain text), played back at ~8X real-time.
Each segment is a bar, whose height is the size (in MB) of the segment (log-scale). Segments on the left are largest; as new segments are flushed, they appear on the right. Segments being merged are colored the same color and, once the merge finishes, are removed and replaced with the new (larger) segment. You can see the nice logarithmic staircase pattern that merging creates.
By default, using ConcurrentMergeScheduler, Lucene executes each merge in a separate thread, allowing multiple merges to run at once without blocking ongoing indexing. The bigger the merge the longer it takes to finish.
One simple metric you can use to measure overall merge cost is to divide the total number of bytes read/written for all merging by the final byte size of an index; smaller values are better. This is analogous to the write amplification measure that solid-state disks use, in that your app has written X MB but because of merging and deleted documents overhead, Lucene had to internally read and write some multiple of X. You can think of this write amplification as a tax on your indexing; you don't pay this tax up front, when the document is first indexed, but only later as you continue to add documents to the index. The video shows the total size of the index as well as net bytes merged, so it's easy to compute write amplification for the above run: 6.19 (final index size was 10.87 GB and net bytes copied during merging was 67.30 GB).
Proper merge selection is actually a tricky problem, in general, because we must carefully balance not burning CPU/IO (due to inefficient merge choices), while also not allowing too many segments to accumulate in the index, as this slows down search performance. To minimize merge cost, you ideally would merge only equal-sized segments, and merge a larger number of segments at a time.
In fact, from the viewpoint of the MergePolicy, this is really a game against a sneaky opponent who randomly makes sudden changes to the index, such as flushing new segments or applying new deletions. If the opponent is well behaved, it'll add equal sized, large segments, which are easy to merge well, as was the case in the above video; but that's a really easy game, like playing tic-tack-toe against a 3 year old.
This opponent is more like playing a game of chess:
No more nice looking staircase! This test shows the more challenging near-real-time use case, which calls updateDocument (= delete + add) at a high rate and frequently opens a new reader (creating a new segment each time). The dark gray band on top of each segment shows the proportion of deletions in that segment. When you delete a document in Lucene, the bytes consumed by that document are not reclaimed until the segment is merged, and you can see old segments being eroded as new segments are appended to the index. Unfortunately, Lucene's current default LogByteSizeMergePolicy struggles to pick good merges against this opponent, often merging irregularly sized segments.
The big issue with LogByteSizeMergePolicy is that it must pick adjacent segments for merging. However, we recently relaxed this longstanding limitation, and I'm working on a new merge policy, TieredMergePolicy (currently a patch on LUCENE-854) to take advantage of this. TieredMergePolicy also fixes some other limitations of LogByteSizeMergePolicy, such as merge cascading that results in occasionally "inadvertently optimizing" the index as well as the overly coarse control it offers over the maximum segment size.
TieredMergePolicy first computes the allowed "budget" of how many segments should be in the index, by counting how many steps the "perfect logarithmic staircase" would require given total index size, minimum segment size (floored), mergeAtOnce, and a new configuration maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be.
Whenever the index is over-budget, it selects the best merge. Potential merges are scored with a combination of skew (basically how "equally sized" the segments being merged are), total size (smaller merges are favored), and how many deleted documents will be reclaimed. It also tries to merge to the exact maximum segment size (default 5GB).
Here's the same difficult near-real-time test, this time using TieredMergePolicy instead:
Note how the segments are now sorted by size, since TieredMergePolicy is allowed to merge non-adjacent segments. For the above difficult run, the write amplification for Lucene's current default merge policy (LogByteSizeMergePolicy) was 14.49 while the new merge policy (TieredMergePolicy) was 13.64, a nice improvement, though not as much as I was expecting. I suspect this is because TieredMergePolicy works hard to hit the max segment size (5 GB), resulting in 6 maximum sized segments while LogByteSizeMergePolicy had only 3. These numbers are much higher than the 6.19 write amplification from the "easy" merging, since that merging was about as efficient as we can hope for.
While TieredMergePolicy is a good improvement over LogByteSizeMergePolicy, it's still theoretically possible to do even better! In particular, TieredMergePolicy is greedy in its decision making: it only looks statically at the index, as it exists right now, and always chooses what looks like the best merge, not taking into account how this merge will affect future merges nor what further changes the opponent is likely to make to the index. This is good, but it's not guaranteed to produce the optimal merge sequence. For any series of changes made by the opponent there is necessarily a corresponding perfect sequence of merges, that minimizes net merge cost while obeying the budget. If instead the merge policy used a search with some lookahead, such as the Minimax algorithm, it could do a better job setting up more efficient future merges. I suspect this theoretical gain is likely small in practice; but if there are any game theorists out there reading this now, I'd love to be proven wrong!
I generated these movies with this simple Python script. It parses the infoStream output from IndexWriter, renders one frame at a time, saved as a PNG file in the local file system, using the Python Imaging Library, and finally encodes all frames into a video using MEncoder with the X264 codec.