Tuesday, June 7, 2011

Lucene's near-real-time search is fast!

Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time. In some cases, such as modern social/news sites (e.g., LinkedIn, Twitter, Facebook, Stack Overflow, Hacker News, DZone, etc.), fast turnaround time is a hard requirement.

Fortunately, it's trivial to use. Just open your initial NRT reader, like this:

// w is your IndexWriter
IndexReader r = IndexReader.open(w, true);

(That's the 3.1+ API; prior to that use w.getReader() instead).

The returned reader behaves just like one opened with IndexReader.open: it exposes the point-in-time snapshot of the index as of when it was opened. Wrap it in an IndexSearcher and search away!

Once you've made changes to the index, call r.reopen() and you'll get another NRT reader; just be sure to close the old one.

What's special about the NRT reader is that it searches uncommitted changes from IndexWriter, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often commit is called), something not previously possible.

Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes. The reopen time is in proportion to how many changes you made since last reopening that reader.

Lucene's approach is a nice compromise between immediate consistency, where changes are visible after each index change, and eventual consistency, where changes are visible "later" but you don't usually know exactly when.

With NRT, your application has controlled consistency: you decide exactly when changes must become visible.

Recently there have been some good improvements related to NRT:
  • New default merge policy, TieredMergePolicy, which is able to select more efficient non-contiguous merges, and favors segments with more deletions.

  • NRTCachingDirectory takes load off the IO system by caching small segments in RAM (LUCENE-3092).

  • When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (LUCENE-2900).

  • Segments that are 100% deleted are now dropped instead of inefficiently merged (LUCENE-2010).

How fast is NRT search?

I created a simple performance test to answer this. I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents.

Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text. This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as Twitter's recent peak during this year's superbowl (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard.

The test uses updateDocument, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments. In addition, 8 search threads run a fixed TermQuery at the same time.

Finally, the NRT reader is reopened once per second.

I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an OCZ Vertex 3 240 GB SSD, using Oracle's 64 bit Java 1.6.0_21 and Linux Fedora 13. I gave Java a 2 GB max heap, and used MMapDirectory.

The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time:



The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive Dygraph, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph.

Search QPS dropped substantially with time. While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching. They are only truly deleted when the segments are merged. TermQuery is a worst-case query; harder queries, such as BooleanQuery, should see less slowdown from deleted, but not reclaimed, documents.

Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped. It looks like TieredMergePolicy should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off. Once it finished the QPS recovered somewhat.

Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical.

Reopen delay during merging

The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate). There are random single spikes, which is caused by Java running a full GC cycle. However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40). Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay. Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching.

I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities. It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen. However, if that were the case, NRTCachingDirectory (used by the test) should have prevented it, but didn't. It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've set Linux's swappiness to 0.

Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's pdflush tunables could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness.

Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling Directory implementations to pass appropriate flags to the OS when opening files for merging (LUCENE-2793 and LUCENE-2795). From past testing I know that passing O_DIRECT can prevent merges from evicting hot pages, so it's possible this will fix our slow reopen time as well since it bypasses the write cache.

Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges.

This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it. It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity. This would make a nice addition to Lucene!

Except for the slowdown during merging, the performance of NRT is impressive. Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough.

While there are exciting ideas to bring true real-time search to Lucene, by directly searching IndexWriter's RAM buffer as Michael Busch has implemented at Twitter with some cool custom extensions to Lucene, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT.

NIOFSDirectory vs MMapDirectory

Out of curiosity, I ran the exact same test as above, but this time with NIOFSDirectory instead of MMapDirectory:



There are some interesting differences. The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot. For some reason TermQuery, in particular, has high variance from one JVM instance to another.

The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur). The takeaway message seems clear: on Linux, use MMapDirectory not NIOFSDirectory!

Optimizing your NRT turnaround time

My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text). You should test specifically for your use-case what reopen rate works best. Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate.

Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements.

There are also some simple steps you can take to reduce the turnaround time:
  • Store the index on a fast IO system, ideally a modern SSD.

  • Install a merged segment warmer (see IndexWriter.setMergedSegmentWarmer). This warmer is invoked by IndexWriter to warm up a newly merged segment without blocking the reopen of a new NRT reader. If your application uses Lucene's FieldCache or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader.

  • Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices. The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk).

  • If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the Pulsing codec for your id field since this gives faster lookup performance which will make your reopen faster.

  • Use the new NRTCachingDirectory, which buffers small segments in RAM to take load off the IO system (LUCENE-3092).

  • Pass false for applyDeletes when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader.

  • While it's not clear that thread priorities actually work correctly (see this Google Tech Talk), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads. If the machine becomes saturated, ideally only the search threads should take the hit.

Happy near-real-time searching!

22 comments:

  1. It's worth mentioning that IndexReader.open(IndexWriter) is only available since Lucene v3.1; earlier versions don't have this method (article assumes that NRT was available since 2.9 and can be misleading). Otherwise a very interesting read, thanks Mike!

    ReplyDelete
  2. Ahh, good point mindas; I just fixed it. Thanks!

    ReplyDelete
  3. I'm curious how this compares to using Zoie (http://sna-projects.com/zoie/) from LinkedIn on top of Lucene? I know they did an awful lot of tuning. Not sure why their work hasn't made it into the core of Lucene.

    ReplyDelete
  4. Hi Yeroc,

    I haven't looked closely a Zoie for some time... so the following
    could very well be "dated":

    The biggest difference is that Zoie aims for immediate consistency
    (reopen after every index change & next query), which I think very few
    apps really require, given how fast NRT is.

    Also, NRTCachingDir (caching small segments in RAM) achieves the
    biggest (in my opinion) benefit of Zoie, but with substantially less
    added complexity. Reducing complexity is important because it means
    less risk of bugs; for example, Zoie had some scary corruption bugs,
    which took quite some time to track down; see
    https://issues.apache.org/jira/browse/LUCENE-2729

    The other part of Zoie I remember is deferring resolving deletions to
    Lucene docIDs, and instead using a bloom filter to post-filter
    collected documents. While I understand the motivation for this
    ("immediate consistency") I think it's the wrong tradeoff since it
    necessarily slows down all searching (checking a bloom filter is more
    costly than Lucene's checking a bit set), not to mention the added RAM
    required for the bloom filter.

    Ie, it's better to spend more time during reopen to resolve the
    deletions, so that searches don't slow down.

    I'm sure there are other differences... and I imagine Zoie has changed
    a lot since I last looked!

    ReplyDelete
  5. Hi,

    Just a question. You mention that "when an NRT reader is opened, Lucene flushes indexed documents as a new segment". So how exactly is this different than a commit?

    Thanks for that nice article!

    ReplyDelete
  6. Hi Pascal,

    A commit must also fsync() the new files, so that the bits are all guaranteed to be on stable storage, which can be extremely costly (seconds).

    Also, the commit must write a new segments_N file, and write deleted docs to disk, which NRT avoids (because it gets these changes directly in RAM).

    ReplyDelete
  7. Hi Mike,

    Great post. Please let me know, is there an e-mail address I can contact you in private?

    ReplyDelete
  8. Mike - I've looked at your code and bought LIA. So I hope I earned a question or two :)

    The NRTManager - does it work well in the scenario that you open one IndexWriter, use that mostly for indexing in one thread, but occasionally for deleting documents in another, and use the IndexSearcher provided here for all else? That appears to be yes, and indeed its primary purpose but the code confuses me in one respect:

    The "real questions" - NRTManager never closes any of the old IndexReader or IndexSearcher instances. Isn't this a problem? And didn't we always get told best practice was to reuse the indexsearcher :)

    ReplyDelete
  9. Hi mjb,

    You can of course ask questions without buying the book! ;)

    That use case you described is exactly what NRTManager is good at! Note also that the next Lucene release (3.5.0) will have improvements to SearcherManager to also pull NRT readers from an IndexWriter.

    NRTManager (and SearcherManager) do in fact close the readers, but it's tricky: rather than call .close(), they call .decRef(); under the hood, .close() in fact calls .decRef, but guarded to ensure only the first .close() decRefs. So the readers are getting closed (and we have good stress tests in Lucene that should fail if they were not getting closed...).

    ReplyDelete
  10. Thanks Mike - I guess this is a dumb question but why use SearcherManager vs NRTManager? Well I guess if the node is read only that's one reason, but is there another

    ReplyDelete
  11. Not a dumb question; it's a great one!

    I'm planning to do a blog post explaining the difference... but the gist is this: use NRTManager if you sometimes care to control which indexing changes are visible to which search threads. Otherwise, use SearcherManager.

    For example say you use Lucene to search discussion threads (on a forum). When user X adds a new comment on a discussion thread, you then go and add that comment into the Lucene index. But then when user X searches, rather than just pulling the "current" NRT reader, which may not yet reflect this user's added comment, you should pull a reader that's guaranteed to reflect the change from the comment the user just added. For this (tying certain searches to certain changes in the index), you need to use NRTManager.

    ReplyDelete
  12. Hi Mike,

    I´m just trying to find out how good Lucene is in comparison to other search tools.

    I recently looked at www.deudil.com and was impressed by the autopopulation and search experience but wasn´t sure whether there is a way to find out what they used.

    I´m really keen to select the best solution to deliver the search experience for our website hence would appreciate your guidance. I read a lot about Funnelback as well but not sure about their solution either.

    Thanks

    ReplyDelete
  13. Mike - this might be clearer in the docs.

    Imagine you have this code

    // Previously build the index of 100000 items of which 10000 have subject:apple
    // Previously run a search showing 10000 results
    System.out.println("Delete item");
    NRTSingleton.getInstance().getNRTManager().deleteDocuments(new Term("subject", "apple"));
    one = NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
    System.out.println(NRTSingleton.getInstance().getNRTManager().maybeReopen(true)); // This is critical, and the "two" will fail otherwise to see the deletion until 45 seconds
    two = NRTSingleton.getInstance().getNRTManager().getSearcherManager(true).acquire();
    r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000); // 10000
    LuceneBasicSearch2.output("Search after deletes - one", r,true);

    r=LuceneBasicSearch2.searchIndex("apple",two,0, 10000); // 0
    LuceneBasicSearch2.output("Search after deletes - two", r,true);

    NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);

    Here's my point.Everything mostly worked as expected. A searcher with (false) passed didn't see the deletes until after the sleep (the nrtbackground thread is set to about 30 s). But it surprised and confused me that the searcher (two) ALSO didn't see the deletes until after the sleep, unless you force the issue with maybeReopen(true). I guess I would have thought the getSearcherManager(true) would have automagically forced a separate IndexSearcher in this case?

    Do you follow me?
    one=null;
    //NRTSingleton.getInstance().getNRTManager().maybeReopen(true);
    System.out.println("Sleep");
    Thread.sleep(45000L);
    System.out.println("After sleep");
    one = NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
    r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000);
    LuceneBasicSearch2.output("Search after deletes and sleep - new one", r,true); // 0 regardless of maybeReopen here since background thread goes every 30 seconds

    ReplyDelete
    Replies
    1. Code got mangled:

      Here it is

      java.io.File path=new File("c:\\index");
      LuceneBuildIndex.createNewIndex(path, 100000);
      IndexWriter iwriter=null;
      IndexSearcher one=null; IndexSearcher two=null;
      IndexWriterConfig iwc =new IndexWriterConfig(LuceneBasicSearch2.version,new StandardAnalyzer(LuceneBasicSearch2.version));
      Directory directory = new SimpleFSDirectory(path);
      try {
      iwriter=new IndexWriter(directory,iwc);
      NRTSingleton.getInstance().setNRTManager(iwriter);
      one =NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
      ResultList r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000);
      LuceneBasicSearch2.output("Search before deletes", r,true); // 10000
      NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);
      one=null;
      System.out.println("Delete item");
      NRTSingleton.getInstance().getNRTManager().deleteDocuments(new Term("subject", "apple"));
      one = NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
      System.out.println(NRTSingleton.getInstance().getNRTManager().maybeReopen(true)); // This is critical
      two = NRTSingleton.getInstance().getNRTManager().getSearcherManager(true).acquire();
      r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000); // 10000
      LuceneBasicSearch2.output("Search after deletes - one", r,true);

      r=LuceneBasicSearch2.searchIndex("apple",two,0, 10000); // 0
      LuceneBasicSearch2.output("Search after deletes - two", r,true);

      NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);
      one=null;
      //NRTSingleton.getInstance().getNRTManager().maybeReopen(true);
      System.out.println("Sleep");
      Thread.sleep(45000L);
      System.out.println("After sleep");
      one = NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
      r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000);
      LuceneBasicSearch2.output("Search after deletes and sleep - new one", r,true); // 0 regardless of maybeReopen here since background thread goes every 30 seconds
      NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);
      one=null;
      } finally {
      if (one !=null) NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);
      if (two !=null) NRTSingleton.getInstance().getNRTManager().getSearcherManager(true).release(two);
      if (iwriter !=null) iwriter.close();
      NRTSingleton.getInstance().close();
      }

      Delete
  14. Hi MJB,

    The boolean you pass to getSearcherManager does not force a reopen. So if you pass true, all it does is get you (immediately) the latest searcher that was reopened with applyAllDeletes=true. Ie, that returned searcher is only guaranteed to have applied all deletes as of when the last maybeReopen(true) was called.

    So you'll still need to use the .waitForGeneration APIs if you need to know a certain operation (the delete op in your example) is visible...

    It's also best to use something like the NRTManagerReopenThread: this class watches for any threads waiting on a specific generation, and reopens more aggressively if there are any.

    Mike

    ReplyDelete
  15. okay..... (and BTW I WAS using the reopen thread, that's why Thread.sleep(45000) showed the changes, even with maybeReopen commmented out)....then why even have ANY boolean argument applyDeletes except for maybeReopen?Am I missing something here? Because it seems like you are saying "If you absolutely must have a searcher that reflects the deletes,either the NRTThread must have run (and you can use waitForGeneration), or apply maybeReOpen(true). And then both getSearcherManager(true) and getSearchManager(false) will BOTH reflect the deletes". Do you follow me that this is confusing, or am I being obtuse?

    ReplyDelete
  16. It is indeed tricky, but I think that boolean to getSearcherManager (in addition to maybeReopen) is still useful.

    Imagine an app where sometimes you need deletes applied (users doing searches) and other times you don't (say, testing whether a certain document is in the index, such that you don't care if the old deleted "ghosts" are also returned).

    For that 2nd case you should pass "false" for applyDeletes to waitForGen, because this tells the reopen thread that the caller who is waiting doesn't need deletes applied, which in turn makes the reopen faster.

    ReplyDelete
  17. Anyone have an example of IndexWriter.SetMergedSegmentWarmer usage?

    ReplyDelete
  18. Hello Mike,

    I have got a problem and I would like to ask you for advise. Every 10 minutes I am updating approx 6k documents with lucene 3.6, there might be some new documents added but most of the time we are just updating and need to see only the latest. I am using NRT and commit index writer after 1000 docs. After approx. 100 iteration I am seeing the problem with constant merging of files (it seems that it constantly merging files with one document?) Can you please advise what policy I should use? From the logs


    Also I have noticed such messages in log:
    DW: ramUsed=889.721 MB newFlushedSize=0.1 MB docs/MB=9.982 new/old=0.011%

    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: don't apply deletes now delTermCount=42 bytesUsed=43008
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: clearFlushPending
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: findMerges: 15 segments
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=*_1d9wh(3.5):C4916/288 size=52.694 MB
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da6s(3.5):C159 size=2.079 MB
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da5p(3.5):C153 size=2.025 MB [merging]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da73(3.5):C10 size=0.223 MB [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7e(3.5):C1 size=0.100 MB [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da78(3.5):C1 size=0.100 MB [merging] [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da76(3.5):C1 size=0.100 MB [merging] [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7g(3.5):C1 size=0.100 MB [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da77(3.5):C1 size=0.100 MB [merging] [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7b(3.5):C1 size=0.100 MB [merging] [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da74(3.5):C1 size=0.100 MB [merging] [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7f(3.5):C1 size=0.071 MB [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7a(3.5):C1 size=0.068 MB [merging] [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7c(3.5):C1 size=0.068 MB [merging] [floored]
    IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da79(3.5):C1 size=0.068 MB [merging] [floored]

    Thanks for any response

    ReplyDelete
  19. Hi Xatrix,

    Could you send an email to the Lucene users list instead (java-user@lucene.apache.org)? Thanks.

    ReplyDelete