Changing Bits: Visualizing Lucene's segment merges

Friday, February 11, 2011

Visualizing Lucene's segment merges

If you've ever wondered how Lucene picks segments to merge during indexing, it looks something like this:

That video displays segment merges while indexing the entire Wikipedia (English) export (29 GB plain text), played back at ~8X real-time.

Each segment is a bar, whose height is the size (in MB) of the segment (log-scale). Segments on the left are largest; as new segments are flushed, they appear on the right. Segments being merged are colored the same color and, once the merge finishes, are removed and replaced with the new (larger) segment. You can see the nice logarithmic staircase pattern that merging creates.

By default, using ConcurrentMergeScheduler, Lucene executes each merge in a separate thread, allowing multiple merges to run at once without blocking ongoing indexing. The bigger the merge the longer it takes to finish.

One simple metric you can use to measure overall merge cost is to divide the total number of bytes read/written for all merging by the final byte size of an index; smaller values are better. This is analogous to the write amplification measure that solid-state disks use, in that your app has written X MB but because of merging and deleted documents overhead, Lucene had to internally read and write some multiple of X. You can think of this write amplification as a tax on your indexing; you don't pay this tax up front, when the document is first indexed, but only later as you continue to add documents to the index. The video shows the total size of the index as well as net bytes merged, so it's easy to compute write amplification for the above run: 6.19 (final index size was 10.87 GB and net bytes copied during merging was 67.30 GB).

Proper merge selection is actually a tricky problem, in general, because we must carefully balance not burning CPU/IO (due to inefficient merge choices), while also not allowing too many segments to accumulate in the index, as this slows down search performance. To minimize merge cost, you ideally would merge only equal-sized segments, and merge a larger number of segments at a time.

In fact, from the viewpoint of the MergePolicy, this is really a game against a sneaky opponent who randomly makes sudden changes to the index, such as flushing new segments or applying new deletions. If the opponent is well behaved, it'll add equal sized, large segments, which are easy to merge well, as was the case in the above video; but that's a really easy game, like playing tic-tack-toe against a 3 year old.

This opponent is more like playing a game of chess:

No more nice looking staircase! This test shows the more challenging near-real-time use case, which calls updateDocument (= delete + add) at a high rate and frequently opens a new reader (creating a new segment each time). The dark gray band on top of each segment shows the proportion of deletions in that segment. When you delete a document in Lucene, the bytes consumed by that document are not reclaimed until the segment is merged, and you can see old segments being eroded as new segments are appended to the index. Unfortunately, Lucene's current default LogByteSizeMergePolicy struggles to pick good merges against this opponent, often merging irregularly sized segments.

The big issue with LogByteSizeMergePolicy is that it must pick adjacent segments for merging. However, we recently relaxed this longstanding limitation, and I'm working on a new merge policy, TieredMergePolicy (currently a patch on LUCENE-854) to take advantage of this. TieredMergePolicy also fixes some other limitations of LogByteSizeMergePolicy, such as merge cascading that results in occasionally "inadvertently optimizing" the index as well as the overly coarse control it offers over the maximum segment size.

TieredMergePolicy first computes the allowed "budget" of how many segments should be in the index, by counting how many steps the "perfect logarithmic staircase" would require given total index size, minimum segment size (floored), mergeAtOnce, and a new configuration maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be.

Whenever the index is over-budget, it selects the best merge. Potential merges are scored with a combination of skew (basically how "equally sized" the segments being merged are), total size (smaller merges are favored), and how many deleted documents will be reclaimed. It also tries to merge to the exact maximum segment size (default 5GB).

Here's the same difficult near-real-time test, this time using TieredMergePolicy instead:

Note how the segments are now sorted by size, since TieredMergePolicy is allowed to merge non-adjacent segments. For the above difficult run, the write amplification for Lucene's current default merge policy (LogByteSizeMergePolicy) was 14.49 while the new merge policy (TieredMergePolicy) was 13.64, a nice improvement, though not as much as I was expecting. I suspect this is because TieredMergePolicy works hard to hit the max segment size (5 GB), resulting in 6 maximum sized segments while LogByteSizeMergePolicy had only 3. These numbers are much higher than the 6.19 write amplification from the "easy" merging, since that merging was about as efficient as we can hope for.

While TieredMergePolicy is a good improvement over LogByteSizeMergePolicy, it's still theoretically possible to do even better! In particular, TieredMergePolicy is greedy in its decision making: it only looks statically at the index, as it exists right now, and always chooses what looks like the best merge, not taking into account how this merge will affect future merges nor what further changes the opponent is likely to make to the index. This is good, but it's not guaranteed to produce the optimal merge sequence. For any series of changes made by the opponent there is necessarily a corresponding perfect sequence of merges, that minimizes net merge cost while obeying the budget. If instead the merge policy used a search with some lookahead, such as the Minimax algorithm, it could do a better job setting up more efficient future merges. I suspect this theoretical gain is likely small in practice; but if there are any game theorists out there reading this now, I'd love to be proven wrong!

I generated these movies with this simple Python script. It parses the infoStream output from IndexWriter, renders one frame at a time, saved as a PNG file in the local file system, using the Python Imaging Library, and finally encodes all frames into a video using MEncoder with the X264 codec.

61 comments:

seanahanFebruary 13, 2011 at 11:41 AM
What version of Lucene was used for this? I'm assuming 3.0.*? You selected an excellent use case, as I've spent much time indexing the text of wikipedia.
ReplyDelete
Replies
John WangFebruary 13, 2011 at 4:56 PM
Great post Michael! Do you mind sharing the code that produces the video?
ReplyDelete
Replies
Michael McCandlessFebruary 13, 2011 at 5:30 PM
Hi seanahan,

This was all against Lucene's trunk version (eventually 4.0), but I'd like to backport the new merge policy to 3.2.

Wikipedia is great fun to index! And, the fact I can index the entire Wikipedia (English) export in 13 min 9 secs using Lucene is awesome (http://chbits.blogspot.com/2010/09/lucenes-indexing-is-fast.html).
ReplyDelete
Replies
Michael McCandlessFebruary 13, 2011 at 5:31 PM
Hi John,

Sure -- the code is already public (see last paragraph in my post).
ReplyDelete
Replies
AnonymousFebruary 14, 2011 at 2:25 PM
Hey Mike - this is awesome!
What about the MergePolicy from Zoie?
ReplyDelete
Replies
Michael McCandlessFebruary 14, 2011 at 5:24 PM
Hi Otis,

Yeah I'm very curious about what BalancedSegmentMergePolicy looks like! But I haven't done it yet... looks like maybe John will make the video :)
ReplyDelete
Replies
hao yanFebruary 16, 2011 at 5:01 PM
I would like to do it for zoie!
ReplyDelete
Replies
UnknownFebruary 23, 2011 at 11:16 PM
Great article Mike.
I'd like to run the program against the IndexWriter log output for building one of our 300GB indexes. It'll take a bit longer than 13 minutes :) The write amplification metric should help us quantify what we intuitively guessed about reducing extra I/O by messing around with the ramBufferSize and the merge factor. I'm looking forward as well to trying the TieredMergePolicy on our data as well.

Tom Burton-West www.hathitrust.org/blogs
ReplyDelete
Replies
Michael McCandlessFebruary 25, 2011 at 7:02 AM
Hi Tom,

I'd love to see the video from your index rebuild ;)

Mike
ReplyDelete
Replies
simonwFebruary 28, 2011 at 7:18 AM
13 min 9 secs - crazy shit man!
ReplyDelete
Replies
Michael SokolovApril 4, 2011 at 12:00 PM
I was intrigued by your query about the theoretical optimum performance here. To be a bit more formal, you'd have to defined optimality in a measurable way, but I'm not sure what the right way is. Apparently it is preferable to merge segments of equal size, but I don't understand why. Could you quantify that in some way? The other factor you'd need to quantify would be the impact of having too many segments. Clearly, in the absence of these two constraints, the policy that minimizes I/O (for a finite run) is not to do any merging until the very end, and then merge everything. So do you have any rule of thumb about how bad it is to have too many segments. Does it get bad in a linear way?
ReplyDelete
Replies
Michael McCandlessApril 4, 2011 at 3:10 PM
Hi sokolov,

Great question on optimality... it's tricky.

One way to bound the problem is to assign a budget: given X GB of indexed text, the index is allowed to have Y (= O(log(X))) segments, each segment at most size M.

Then, optimal is that merge policy that fits within this budget as docs are added, and whose net cost (say total number of bytes read + written) is least?

But, that bound might rightly be accused of assuming the structure of the solution.

So, broadening it... you have to consider simultaneous searching and indexing, ie, periodically (perhaps, very frequently, in the near-real-time case) a reader will be reopened on the (growing) index and searches run with that.

The more segments there are, the more costly the searching is; that added cost varies depending on the type of query, but for the simple case (single term) it's likely close to a relatively smallish constant per segment (ie cost to seek to that segment's postings for this term). For large segments that constant cost is dwarfed by the cost to iterate the postings, but for medium and tiny segments the cost can be highish, I suspect.

You'd have to pick a given "run" (docs indexed at a specific times and searches run at specific times), and then add up merging cost and searching cost and minimize that?

On one extreme (no searching only indexing), yes, you should do no merging at all. Or, if it's indexing and then searching (ie they never overlap), then do no merging, except for one giant merge before searching.

At the other extreme (rarely adding/updating a doc and intense searching) you'd presumably want to optimize (merge all segments to 1) after each added document.

I think?
ReplyDelete
Replies
Michael SokolovApril 8, 2011 at 4:29 PM
It took me a while to realize you'd replied: I thought I might get an e-mail from blogger when that happened. Then I thought to check back. In any case, I understand now that your original question assumed the budget as you described above, and something about a way to model the search cost. Something to ponder in an idle moment...
ReplyDelete
Replies
Babak FarhangApril 20, 2011 at 4:08 AM
Late to the party.. Nice article!

One thing that stood out for me in all 3 videos was the relatively large proportion of deleted documents in segments that had breached the 5Gb max-mergeable-segment-size limit. I had not really thought about this, but now seeing your videos, it seems pretty obvious that it should behave so: once a segment reaches the size limit, it never recovers space lost to the deleted [updated] documents since the segment will never again be merged--unless we run a full optimize, of course.

This suggests to me that perhaps a special type of "split" or "spill" merge might be useful. The idea is to somehow "merge" 2 (in our case) >5Gb segments (containing a large proportion of deletes) in such a way that we end up with one 5Gb segment and say a leftover 3Gb segment.

A simple way to implement this might be to temporarily relax the limit and create an intermediate (in our example) 8Gb merged segment and then split that merged segment into 2 final 5 and 3Gb ones.

What do you think?
ReplyDelete
Replies
Babak FarhangApril 20, 2011 at 4:21 AM
> A simple way to implement this might be to temporarily relax the limit and create an intermediate (in our example) 8Gb merged segment and then split that merged segment into 2 final 5 and 3Gb ones.

On second thought a much simpler way is to just merge the 5Gb segment containing lots of deletes with an empty segment.
ReplyDelete
Replies
Michael McCandlessApril 23, 2011 at 7:48 PM
Hi Babak,

Actually, the default Lucene merge policies now pro-rate a segment's size by what percentage of the docs are not deleted. So if that 5 GB segment has 50% deletions it "looks like" a 2.5 GB segment and is then able to merge with another 2.5 GB segment. But you're right, we could also just merge sooner w/ empty segment; expungeDeletes will actually do this (eg if there's only 1 segment with too many deletions). Making a good merge policy is tricky!
ReplyDelete
Replies
AnonymousJuly 19, 2011 at 12:50 PM
Hi Mike,
This is Ravi,We are using Lucene-2.0 in our email archiving product for indexing/searching the documents.
We have been facing a critical problem which affecting the production on customer sites, the problem is while optimization taking place on larger indices of size > 2 GB, the indexer threads getting into blocked state, since index writer opened for optimization purpose is never getting released back for ongoing indexing. Here i am giving you thread dump of blocked indexer threads and optimizer thread.
I appreciate your help and suggestion in this regard.
ReplyDelete
Replies
Michael McCandlessJuly 19, 2011 at 2:23 PM
Hi Ravi,

Can you bring this question to Lucene's user's list? (ie, send an email to java-user@lucene.apache.org).
ReplyDelete
Replies
AnonymousJuly 19, 2011 at 11:46 PM
Definitely Mike, I will be sending out an email with full details and thread dumps which i have.
ReplyDelete
Replies
AnonymousJuly 19, 2011 at 11:58 PM
Sent an email with subject
Indexer Threads Getting Into BLOCKED State While Optimization Taking Place On Large Indexes Of Size > 2GB

The email is having thread dumps, index sizes information etc.
ReplyDelete
Replies
AnonymousApril 6, 2012 at 1:51 AM
Hi Mike,
I am very inpressed with these video, I am still learning Lucene, and I had a very weird problem yesterday that no matter what I did, I always got multifiles not compound file. By default, Lucene use compound file, but I just got this weird problem, even if I used setUseCompoundFile(true) explicitly... And my code is actually adapted from your book Lucene In Action 2ed and the demo code in lucene.apache.org... How can I fix this?
ReplyDelete
Replies
Michael McCandlessApril 6, 2012 at 6:46 AM
Are you using TieredMergePolicy? Most like it's the setNoCFSRatio that you're hitting. Ie, by default, TMP will allow very large segments (> 10% of total index size) to remain in non-compound format.
ReplyDelete
Replies
AnonymousApril 7, 2012 at 4:58 AM
Thank you very much, Mike. I solved it by setting setNoCFSRatio(1.0). I am using LogByteSizeMergePolicy, now I am just beginner in Lucene... Anyway I feel a little bit weird, I indexed the same sample data from Lucene In Action. But when I run the "Ant Indexer" command on the Lucene In Action folder, it works fine...it comes out a compound file... I am using Lucene 3.1.0... My code is almost the same with your code in the book... So how can your code in book come out a compound file, but I get multifile.... Anyway, thanks a lot...
ReplyDelete
Replies
Michael McCandlessApril 7, 2012 at 10:47 AM
I think this is because the CFS ratio feature showed in up 3.0.3 (and 3.1.0), just after what LIA2 uses (it uses 3.0.2)....
ReplyDelete
Replies
David SmileyMay 14, 2012 at 11:15 AM
Mike,
Do you have any advise on tuning TieredMergePolicy's parameters? Or at least refer me to such information? I sorta see what they do but I have trouble figuring out if I should change their defaults and to what amount to accomplish whatever goals. I know what to do with mergeFactor: higher means fast indexing but slower searches, and vice versa. Is there a parameter setting that reduces the total disk requirements during merging/optimize?
ReplyDelete
Replies
Michael McCandlessMay 14, 2012 at 3:07 PM
Hi David,

TMP splits mergeFactor into two params: segmentsPerTier (= how many segments are allowed per "level" in the index) and maxMergeAtOnce/Explicit (= when a merge runs it can merge at most this many segments at a time).

Higher segmentsPerTier mean less merging during indexing but slower searching. maxMergeAtOnce/Explicit is really a function of how much RAM + IO bandwidth the machine doing the indexing has; you just have to test to see where the sweet spot is.

TMP also has reclaimDeletesWeight: higher values will more strongly favor merging segments that have a high %tg of deleted documents.
ReplyDelete
Replies
thomasJune 19, 2012 at 7:52 AM
Hello Mike,

If I'm not mistaken then, the variable maxMergeAtOnce does not only control how many segments are merged at once, but also how many are at least required for a merge to happen.
ReplyDelete
Replies
Michael McCandlessJune 19, 2012 at 8:36 AM
Hi thomas,

You're right: if you have fewer than maxMergeAtOnce segments in the index, then no merging will happen.

It is possible for a natural (not forced aka optimize) merge to merge fewer than maxMergeAtOnce segments, when the maxMergedSegmentBytes is hit. But this is normally rare, because usually we can merge in smaller/tiny segments to get up to maxMergeAtOnce while staying below the maxMergedSegmentBytes limit.
ReplyDelete
Replies
AnonymousApril 22, 2013 at 12:11 PM
Hi Mike,

I know its almost 2 years since you posted this, but this is an awesome post. I am a research student trying to figure out how to improve lucene further, and wanted to show demos to my advisors which can clearly communicate how stuff works inside lucene. Your awesome videos there helped me a lot. You saved me a lot of work.

Thank you.
ReplyDelete
Replies
Michael McCandlessApril 22, 2013 at 12:24 PM
Hi Anonymous,

You're welcome!
ReplyDelete
Replies
kbrosMay 20, 2013 at 5:00 PM
Hi Mike,
I try to understand about the spare storage needed for a living steady index (meaning add rate=delete rate=10% of the index per month).
If I understand well, the original index stays intact until the merge is done, while all the merging-segments are being processed on memory and the merged segment is copied on the spare storage, and then the original segments are deleted from disk.
This space storage is up-limited by factor 2 of the actual index storage - representing the worst case of optimizing with no-deleted docs.
But in fact, the actual spare storage I do see in the video is much lower because of the 5GB limit, only alive docs are copied and the probability of a concurrent merge in a small indexing rate as mentioned should be very low.

Can I conclude, assuming no optimize process, that the spare storage is close to the worst case single merge, no relation to the mergeFactor? Which makes it 5GB spare only, right?
ReplyDelete
Replies
Michael McCandlessMay 20, 2013 at 5:36 PM
Hi kbros,

Actually, transient disk usage is a bit more complex, due to commits. A commit point will hold references to the segments that existed as of that commit, which means a merge that completes cannot delete those segments referenced by any commit points, until a new commit is done and the old commits are dropped.

The worst case situation is you open an index, call forceMerge(1) (what optimize was renamed to), and you enable compound-file-format. In this case the peak disk usage will be 3X of the original index size, and it occurs after the merge finished and the compound file was built. Once you commit then the index is back to 1X (well, less, since merged segments take less storage the sum of the original segments).

But, yes, the 5 GB cap will limit the transient space required for a merge. And merging is not done in RAM (certain things are in RAM, but the big things stay on disk).
ReplyDelete
Replies
MrugendraMarch 18, 2014 at 12:01 PM
How InPlaceMergeSorter is applied during indexing?
ReplyDelete
Replies
AnonymousMay 18, 2014 at 6:23 AM
Hi Mike, your video has covered an insight about the merging of segments. I did not find any article or blog that has given this much clarity about merging, thanks. I need your advise on setting up solr configuration parameters. I am facing issues with memory and time taken by merging the segments while performing delta import. My present index has around 5.4 million documents and indexed file size is 37.5 GB. Through delta import, it indexes around 2000-2500 documents and taking more than an hour to merge the segments. There is no delete or update during the delta indexing. What are the solr configurations that I can use to reduce the merging time and memory use. On 48 GB RAM machine, it consumes around 40 GB during merging
--sabeer
ReplyDelete
Replies
AnonymousJuly 6, 2014 at 12:01 PM
I am having an issue where when using the TieredMergePolicy (via elasticsearch), one node does 4x as many merges as the other nodes but with .25x the throughput. Looking at the info stream logs it seems that there are many small segments in the index, maybe due to periodic refreshing due to real time requirements?

Would it make sense to somehow allow more segments to be merged at once when doing a merge of the "bottom" step?
ReplyDelete
Replies
AnonymousOctober 13, 2014 at 12:25 AM
Hi Mike,
Great article. Thank you for sharing all the info. here including how to make / capture the frames while merge processing is going on. I see that this article is 3 years old. Would you please comment on currentness of this article w.r.to latest version Lucene. How is the impacting Solr 4.10 Thank you!. -hg
ReplyDelete
Replies
unity is strenghtFebruary 20, 2015 at 10:44 PM
Hi Mike,

I am an Appian BPM Administrator and from past few months in to situation where Prod outage is reported frequently due Jboss(EAP 6.3) going down which is the our application server.. happens due to JVM crash. Now crash logs have same story all the time where current thread is LUCENE merge thread and all the elasticsearch thread are blocked ...which is native code as far as I know and explained by Java support. I really dont know anything about indexing/lucene/elasticsearch where is Lucene coming into picture is we use IIS as a web server ...Any guidence will be helpful as it is Prod we also have jconsole connected but not sure which thread to minter for lucene....thanks R
ReplyDelete
Replies
AnonymousMarch 6, 2015 at 1:41 PM
HI Mike,

Can you share your script. The link doesn;t work
ReplyDelete
Replies
AnonymousSeptember 2, 2016 at 10:48 AM
Hi Mike,

Can you please explain the pros and cons of using useCompoundFile setting?

Thank you
ReplyDelete
Replies
WilliamKingSeptember 20, 2016 at 6:28 AM
nice
ReplyDelete
Replies
UnknownMay 11, 2017 at 1:34 AM
Hi Michael ,
Read you article its awesome . I am indexing data close to 45 Million . Documents on solr 6.4.2 . Primarily it consists e of close to 80 odd fields of which some are large text fields and some are very large fields(articles or reports). All three cores are on single solr instance . I am facing issues with commit of data on one core (the core which is master core has 50% of documents) . It takes some hours to commit data and rest are working fine . all Have same config . cofig parameters are all default with atocmommit, softcommit,tlog disabled. Please It will be really helpful if you could point me to a direction to look .
ReplyDelete
Replies
ravikumar ithaMay 16, 2017 at 5:29 PM
Michael - this is awesome blog. Six years on but still fascinating to read. +1 for your focus to the most detail e.g. "updateDocument (= delete + add)".

Can you please explain how this analogy can be retrofitted to how updates are handled in the recent versions i.e. Atomic Updates, In-place updates, Optimistic Concurrency type of updates.

This page: https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents#UpdatingPartsofDocuments-OptimisticConcurrency

does not talk about the internals. Is in-place update just for user convenience and is still delete and insert? Thanks

ReplyDelete
Replies
UnknownMay 18, 2017 at 7:48 AM
Hi michael,
I just like to know when the each segment is creating in Lucene, once the documents are flushed from tlog files to I/O disk or one each commit it create new small size segnment or any size based segement creation on reach of certain size.
ReplyDelete
Replies

Add comment