Friday, February 11, 2011

Visualizing Lucene's segment merges

If you've ever wondered how Lucene picks segments to merge during indexing, it looks something like this:



That video displays segment merges while indexing the entire Wikipedia (English) export (29 GB plain text), played back at ~8X real-time.

Each segment is a bar, whose height is the size (in MB) of the segment (log-scale). Segments on the left are largest; as new segments are flushed, they appear on the right. Segments being merged are colored the same color and, once the merge finishes, are removed and replaced with the new (larger) segment. You can see the nice logarithmic staircase pattern that merging creates.

By default, using ConcurrentMergeScheduler, Lucene executes each merge in a separate thread, allowing multiple merges to run at once without blocking ongoing indexing. The bigger the merge the longer it takes to finish.

One simple metric you can use to measure overall merge cost is to divide the total number of bytes read/written for all merging by the final byte size of an index; smaller values are better. This is analogous to the write amplification measure that solid-state disks use, in that your app has written X MB but because of merging and deleted documents overhead, Lucene had to internally read and write some multiple of X. You can think of this write amplification as a tax on your indexing; you don't pay this tax up front, when the document is first indexed, but only later as you continue to add documents to the index. The video shows the total size of the index as well as net bytes merged, so it's easy to compute write amplification for the above run: 6.19 (final index size was 10.87 GB and net bytes copied during merging was 67.30 GB).

Proper merge selection is actually a tricky problem, in general, because we must carefully balance not burning CPU/IO (due to inefficient merge choices), while also not allowing too many segments to accumulate in the index, as this slows down search performance. To minimize merge cost, you ideally would merge only equal-sized segments, and merge a larger number of segments at a time.

In fact, from the viewpoint of the MergePolicy, this is really a game against a sneaky opponent who randomly makes sudden changes to the index, such as flushing new segments or applying new deletions. If the opponent is well behaved, it'll add equal sized, large segments, which are easy to merge well, as was the case in the above video; but that's a really easy game, like playing tic-tack-toe against a 3 year old.

This opponent is more like playing a game of chess:



No more nice looking staircase! This test shows the more challenging near-real-time use case, which calls updateDocument (= delete + add) at a high rate and frequently opens a new reader (creating a new segment each time). The dark gray band on top of each segment shows the proportion of deletions in that segment. When you delete a document in Lucene, the bytes consumed by that document are not reclaimed until the segment is merged, and you can see old segments being eroded as new segments are appended to the index. Unfortunately, Lucene's current default LogByteSizeMergePolicy struggles to pick good merges against this opponent, often merging irregularly sized segments.

The big issue with LogByteSizeMergePolicy is that it must pick adjacent segments for merging. However, we recently relaxed this longstanding limitation, and I'm working on a new merge policy, TieredMergePolicy (currently a patch on LUCENE-854) to take advantage of this. TieredMergePolicy also fixes some other limitations of LogByteSizeMergePolicy, such as merge cascading that results in occasionally "inadvertently optimizing" the index as well as the overly coarse control it offers over the maximum segment size.

TieredMergePolicy first computes the allowed "budget" of how many segments should be in the index, by counting how many steps the "perfect logarithmic staircase" would require given total index size, minimum segment size (floored), mergeAtOnce, and a new configuration maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be.

Whenever the index is over-budget, it selects the best merge. Potential merges are scored with a combination of skew (basically how "equally sized" the segments being merged are), total size (smaller merges are favored), and how many deleted documents will be reclaimed. It also tries to merge to the exact maximum segment size (default 5GB).

Here's the same difficult near-real-time test, this time using TieredMergePolicy instead:



Note how the segments are now sorted by size, since TieredMergePolicy is allowed to merge non-adjacent segments. For the above difficult run, the write amplification for Lucene's current default merge policy (LogByteSizeMergePolicy) was 14.49 while the new merge policy (TieredMergePolicy) was 13.64, a nice improvement, though not as much as I was expecting. I suspect this is because TieredMergePolicy works hard to hit the max segment size (5 GB), resulting in 6 maximum sized segments while LogByteSizeMergePolicy had only 3. These numbers are much higher than the 6.19 write amplification from the "easy" merging, since that merging was about as efficient as we can hope for.

While TieredMergePolicy is a good improvement over LogByteSizeMergePolicy, it's still theoretically possible to do even better! In particular, TieredMergePolicy is greedy in its decision making: it only looks statically at the index, as it exists right now, and always chooses what looks like the best merge, not taking into account how this merge will affect future merges nor what further changes the opponent is likely to make to the index. This is good, but it's not guaranteed to produce the optimal merge sequence. For any series of changes made by the opponent there is necessarily a corresponding perfect sequence of merges, that minimizes net merge cost while obeying the budget. If instead the merge policy used a search with some lookahead, such as the Minimax algorithm, it could do a better job setting up more efficient future merges. I suspect this theoretical gain is likely small in practice; but if there are any game theorists out there reading this now, I'd love to be proven wrong!

I generated these movies with this simple Python script. It parses the infoStream output from IndexWriter, renders one frame at a time, saved as a PNG file in the local file system, using the Python Imaging Library, and finally encodes all frames into a video using MEncoder with the X264 codec.

41 comments:

  1. What version of Lucene was used for this? I'm assuming 3.0.*? You selected an excellent use case, as I've spent much time indexing the text of wikipedia.

    ReplyDelete
  2. Great post Michael! Do you mind sharing the code that produces the video?

    ReplyDelete
  3. Hi seanahan,

    This was all against Lucene's trunk version (eventually 4.0), but I'd like to backport the new merge policy to 3.2.

    Wikipedia is great fun to index! And, the fact I can index the entire Wikipedia (English) export in 13 min 9 secs using Lucene is awesome (http://chbits.blogspot.com/2010/09/lucenes-indexing-is-fast.html).

    ReplyDelete
  4. Hi John,

    Sure -- the code is already public (see last paragraph in my post).

    ReplyDelete
  5. Hey Mike - this is awesome!
    What about the MergePolicy from Zoie?

    ReplyDelete
  6. Hi Otis,

    Yeah I'm very curious about what BalancedSegmentMergePolicy looks like! But I haven't done it yet... looks like maybe John will make the video :)

    ReplyDelete
  7. I would like to do it for zoie!

    ReplyDelete
  8. Great article Mike.
    I'd like to run the program against the IndexWriter log output for building one of our 300GB indexes. It'll take a bit longer than 13 minutes :) The write amplification metric should help us quantify what we intuitively guessed about reducing extra I/O by messing around with the ramBufferSize and the merge factor. I'm looking forward as well to trying the TieredMergePolicy on our data as well.

    Tom Burton-West www.hathitrust.org/blogs

    ReplyDelete
  9. Hi Tom,

    I'd love to see the video from your index rebuild ;)

    Mike

    ReplyDelete
  10. 13 min 9 secs - crazy shit man!

    ReplyDelete
  11. I was intrigued by your query about the theoretical optimum performance here. To be a bit more formal, you'd have to defined optimality in a measurable way, but I'm not sure what the right way is. Apparently it is preferable to merge segments of equal size, but I don't understand why. Could you quantify that in some way? The other factor you'd need to quantify would be the impact of having too many segments. Clearly, in the absence of these two constraints, the policy that minimizes I/O (for a finite run) is not to do any merging until the very end, and then merge everything. So do you have any rule of thumb about how bad it is to have too many segments. Does it get bad in a linear way?

    ReplyDelete
  12. Hi sokolov,

    Great question on optimality... it's tricky.

    One way to bound the problem is to assign a budget: given X GB of indexed text, the index is allowed to have Y (= O(log(X))) segments, each segment at most size M.

    Then, optimal is that merge policy that fits within this budget as docs are added, and whose net cost (say total number of bytes read + written) is least?

    But, that bound might rightly be accused of assuming the structure of the solution.

    So, broadening it... you have to consider simultaneous searching and indexing, ie, periodically (perhaps, very frequently, in the near-real-time case) a reader will be reopened on the (growing) index and searches run with that.

    The more segments there are, the more costly the searching is; that added cost varies depending on the type of query, but for the simple case (single term) it's likely close to a relatively smallish constant per segment (ie cost to seek to that segment's postings for this term). For large segments that constant cost is dwarfed by the cost to iterate the postings, but for medium and tiny segments the cost can be highish, I suspect.

    You'd have to pick a given "run" (docs indexed at a specific times and searches run at specific times), and then add up merging cost and searching cost and minimize that?

    On one extreme (no searching only indexing), yes, you should do no merging at all. Or, if it's indexing and then searching (ie they never overlap), then do no merging, except for one giant merge before searching.

    At the other extreme (rarely adding/updating a doc and intense searching) you'd presumably want to optimize (merge all segments to 1) after each added document.

    I think?

    ReplyDelete
  13. It took me a while to realize you'd replied: I thought I might get an e-mail from blogger when that happened. Then I thought to check back. In any case, I understand now that your original question assumed the budget as you described above, and something about a way to model the search cost. Something to ponder in an idle moment...

    ReplyDelete
  14. Late to the party.. Nice article!

    One thing that stood out for me in all 3 videos was the relatively large proportion of deleted documents in segments that had breached the 5Gb max-mergeable-segment-size limit. I had not really thought about this, but now seeing your videos, it seems pretty obvious that it should behave so: once a segment reaches the size limit, it never recovers space lost to the deleted [updated] documents since the segment will never again be merged--unless we run a full optimize, of course.

    This suggests to me that perhaps a special type of "split" or "spill" merge might be useful. The idea is to somehow "merge" 2 (in our case) >5Gb segments (containing a large proportion of deletes) in such a way that we end up with one 5Gb segment and say a leftover 3Gb segment.

    A simple way to implement this might be to temporarily relax the limit and create an intermediate (in our example) 8Gb merged segment and then split that merged segment into 2 final 5 and 3Gb ones.

    What do you think?

    ReplyDelete
  15. > A simple way to implement this might be to temporarily relax the limit and create an intermediate (in our example) 8Gb merged segment and then split that merged segment into 2 final 5 and 3Gb ones.

    On second thought a much simpler way is to just merge the 5Gb segment containing lots of deletes with an empty segment.

    ReplyDelete
  16. Hi Babak,

    Actually, the default Lucene merge policies now pro-rate a segment's size by what percentage of the docs are not deleted. So if that 5 GB segment has 50% deletions it "looks like" a 2.5 GB segment and is then able to merge with another 2.5 GB segment. But you're right, we could also just merge sooner w/ empty segment; expungeDeletes will actually do this (eg if there's only 1 segment with too many deletions). Making a good merge policy is tricky!

    ReplyDelete
  17. Hi Mike,
    This is Ravi,We are using Lucene-2.0 in our email archiving product for indexing/searching the documents.
    We have been facing a critical problem which affecting the production on customer sites, the problem is while optimization taking place on larger indices of size > 2 GB, the indexer threads getting into blocked state, since index writer opened for optimization purpose is never getting released back for ongoing indexing. Here i am giving you thread dump of blocked indexer threads and optimizer thread.
    I appreciate your help and suggestion in this regard.

    ReplyDelete
  18. Hi Ravi,

    Can you bring this question to Lucene's user's list? (ie, send an email to java-user@lucene.apache.org).

    ReplyDelete
  19. Definitely Mike, I will be sending out an email with full details and thread dumps which i have.

    ReplyDelete
  20. Sent an email with subject
    Indexer Threads Getting Into BLOCKED State While Optimization Taking Place On Large Indexes Of Size > 2GB

    The email is having thread dumps, index sizes information etc.

    ReplyDelete
  21. Hi Mike,
    I am very inpressed with these video, I am still learning Lucene, and I had a very weird problem yesterday that no matter what I did, I always got multifiles not compound file. By default, Lucene use compound file, but I just got this weird problem, even if I used setUseCompoundFile(true) explicitly... And my code is actually adapted from your book Lucene In Action 2ed and the demo code in lucene.apache.org... How can I fix this?

    ReplyDelete
  22. Are you using TieredMergePolicy? Most like it's the setNoCFSRatio that you're hitting. Ie, by default, TMP will allow very large segments (> 10% of total index size) to remain in non-compound format.

    ReplyDelete
  23. Thank you very much, Mike. I solved it by setting setNoCFSRatio(1.0). I am using LogByteSizeMergePolicy, now I am just beginner in Lucene... Anyway I feel a little bit weird, I indexed the same sample data from Lucene In Action. But when I run the "Ant Indexer" command on the Lucene In Action folder, it works fine...it comes out a compound file... I am using Lucene 3.1.0... My code is almost the same with your code in the book... So how can your code in book come out a compound file, but I get multifile.... Anyway, thanks a lot...

    ReplyDelete
  24. I think this is because the CFS ratio feature showed in up 3.0.3 (and 3.1.0), just after what LIA2 uses (it uses 3.0.2)....

    ReplyDelete
  25. Mike,
    Do you have any advise on tuning TieredMergePolicy's parameters? Or at least refer me to such information? I sorta see what they do but I have trouble figuring out if I should change their defaults and to what amount to accomplish whatever goals. I know what to do with mergeFactor: higher means fast indexing but slower searches, and vice versa. Is there a parameter setting that reduces the total disk requirements during merging/optimize?

    ReplyDelete
  26. Hi David,

    TMP splits mergeFactor into two params: segmentsPerTier (= how many segments are allowed per "level" in the index) and maxMergeAtOnce/Explicit (= when a merge runs it can merge at most this many segments at a time).

    Higher segmentsPerTier mean less merging during indexing but slower searching. maxMergeAtOnce/Explicit is really a function of how much RAM + IO bandwidth the machine doing the indexing has; you just have to test to see where the sweet spot is.

    TMP also has reclaimDeletesWeight: higher values will more strongly favor merging segments that have a high %tg of deleted documents.

    ReplyDelete
  27. Hello Mike,

    If I'm not mistaken then, the variable maxMergeAtOnce does not only control how many segments are merged at once, but also how many are at least required for a merge to happen.

    ReplyDelete
  28. Hi thomas,

    You're right: if you have fewer than maxMergeAtOnce segments in the index, then no merging will happen.

    It is possible for a natural (not forced aka optimize) merge to merge fewer than maxMergeAtOnce segments, when the maxMergedSegmentBytes is hit. But this is normally rare, because usually we can merge in smaller/tiny segments to get up to maxMergeAtOnce while staying below the maxMergedSegmentBytes limit.

    ReplyDelete
  29. Hi Mike,

    I know its almost 2 years since you posted this, but this is an awesome post. I am a research student trying to figure out how to improve lucene further, and wanted to show demos to my advisors which can clearly communicate how stuff works inside lucene. Your awesome videos there helped me a lot. You saved me a lot of work.

    Thank you.

    ReplyDelete
  30. Hi Mike,
    I try to understand about the spare storage needed for a living steady index (meaning add rate=delete rate=10% of the index per month).
    If I understand well, the original index stays intact until the merge is done, while all the merging-segments are being processed on memory and the merged segment is copied on the spare storage, and then the original segments are deleted from disk.
    This space storage is up-limited by factor 2 of the actual index storage - representing the worst case of optimizing with no-deleted docs.
    But in fact, the actual spare storage I do see in the video is much lower because of the 5GB limit, only alive docs are copied and the probability of a concurrent merge in a small indexing rate as mentioned should be very low.

    Can I conclude, assuming no optimize process, that the spare storage is close to the worst case single merge, no relation to the mergeFactor? Which makes it 5GB spare only, right?

    ReplyDelete
  31. Hi kbros,

    Actually, transient disk usage is a bit more complex, due to commits. A commit point will hold references to the segments that existed as of that commit, which means a merge that completes cannot delete those segments referenced by any commit points, until a new commit is done and the old commits are dropped.

    The worst case situation is you open an index, call forceMerge(1) (what optimize was renamed to), and you enable compound-file-format. In this case the peak disk usage will be 3X of the original index size, and it occurs after the merge finished and the compound file was built. Once you commit then the index is back to 1X (well, less, since merged segments take less storage the sum of the original segments).

    But, yes, the 5 GB cap will limit the transient space required for a merge. And merging is not done in RAM (certain things are in RAM, but the big things stay on disk).

    ReplyDelete
    Replies
    1. For the last paragraph - "certain things stay on RAM" - what would be the peak memory usage for this max merge of 5GB?
      I assume the merge is done by parts between the inverted and uninverted parts, as the univerted files just need to be appended to a new file, requiring only few MB of buffer.
      Let s say my inverted files are 2GB out of these "about to merge" segments, do you have any approximation of what would be this peak memory usage?

      Delete
    2. The big thing that stays in RAM is a logical int[] mapping old docIDs to new docIDs, but in more recent versions of Lucene (4.x) we use a much more efficient structure than a simple int[] ... see https://issues.apache.org/jira/browse/LUCENE-2357

      How much RAM is required is mostly a function of how many documents (lots of tiny docs use more RAM than fewer huge docs).

      Delete
    3. Why is the default MaxMergedSegmentMB set by default to 5 GB and not more?

      Setting it to twice that big would save upto twice IOs, so what would be the cons of it?

      Thanks

      Delete
    4. Well, the idea is to avoid the "truly massive" merges that unexpectedly make IW.close take hours ... it's also not clear that there are any real search perf gains once individual segments are so large. You don't really have twice the IOs for searching ... you do have twice the seeks, but, if the segments are large enough, that seek cost is amortized.

      Delete
  32. How InPlaceMergeSorter is applied during indexing?

    ReplyDelete
    Replies
    1. InPlaceMergeSorter is a low level API, to sort an array in place ... it's used in random places; try grep'ing the sources?

      Delete
  33. Hi Mike, your video has covered an insight about the merging of segments. I did not find any article or blog that has given this much clarity about merging, thanks. I need your advise on setting up solr configuration parameters. I am facing issues with memory and time taken by merging the segments while performing delta import. My present index has around 5.4 million documents and indexed file size is 37.5 GB. Through delta import, it indexes around 2000-2500 documents and taking more than an hour to merge the segments. There is no delete or update during the delta indexing. What are the solr configurations that I can use to reduce the merging time and memory use. On 48 GB RAM machine, it consumes around 40 GB during merging
    --sabeer

    ReplyDelete
    Replies
    1. Hi, please ask that question on the Solr user's list (solr-user@lucene.apache.org).

      Delete
  34. I am having an issue where when using the TieredMergePolicy (via elasticsearch), one node does 4x as many merges as the other nodes but with .25x the throughput. Looking at the info stream logs it seems that there are many small segments in the index, maybe due to periodic refreshing due to real time requirements?

    Would it make sense to somehow allow more segments to be merged at once when doing a merge of the "bottom" step?

    ReplyDelete