SearcherManagerclass, coming in the next (3.5.0) Lucene release, to periodically reopen your
IndexSearcherwhen multiple threads need to share it. This class presents a very simple
releaseAPI, hiding the thread-safe complexities of opening and closing the underlying
But that example used a non near-real-time (NRT)
IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call
If you have access to the
IndexWriterthat's actively changing the index (i.e., it's in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new
IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice "best of both worlds" blend between the traditional immediate and eventual consistency models.
Since reopening an NRT reader bypasses the costly commit, and shares some data structures directly in RAM instead of writing/reading to/from files, it provides extremely fast turnaround time on making index changes visible to searchers. Frequent reopens such as every 50 milliseconds, even under relatively high indexing rates, is easily achievable on modern hardware.
Fortunately, it's trivial to use
SearcherManagerwith NRT readers: use the constructor that takes
boolean applyAllDeletes = true; ExecutorService es = null; SearcherManager mgr = new SearcherManager(writer, applyAllDeletes, new MySearchWarmer(), es);This tells
SearcherManagerthat its source for new
IndexReaders is the provided
IndexWriterinstance (instead of a
Directoryinstance). After that, use the
SearcherManagerjust as before.
Typically you'll set the
true, meaning each reopened reader is required to apply all previous deletion operations (
updateDocument/s) up until that point.
Sometimes your usage won't require deletions to be applied. For example, perhaps you index multiple versions of each document over time, always deleting the older versions, yet during searching you have some way to ignore the old versions. If that's the case, you can pass
applyAllDeletes=falseinstead. This will make the turnaround time quite a bit faster, as the primary-key lookups required to resolve deletes can be costly. However, if you're using Lucene's trunk (to be eventually released as 4.0), another option is to use
idfield to greatly reduce the primary-key lookup time.
Note that some or even all of the previous deletes may still be applied even if you pass
false. Also, the pending deletes are never lost if you pass
false: they remain buffered and will still eventually be applied.
If you have some searches that can tolerate unapplied deletes and others that cannot, it's perfectly fine to create two
SearcherManagers, one applying deletes and one not.
If you pass a non-null
ExecutorService, then each segment in the index can be searched concurrently; this is a way to gain concurrency within a single search request. Most applications do not require this, because the concurrency across multiple searches is sufficient. It's also not clear that this is effective in general as it adds per-segment overhead, and the available concurrency is a function of your index structure. Perversely, a fully optimized index will have no concurrency! Most applications should pass
What if you want the fast turnaround time of NRT readers, but need control over when specific index changes become visible to certain searches? Use
NRTManagerholds onto the
IndexWriterinstance you provide and then exposes the same APIs for making index changes (
deleteDocuments). These methods forward to the underlying
IndexWriter, but then return a generation token (a Java
long) which you can hold onto after making any given change. The generation only increases over time, so if you make a group of changes, just keep the generation returned from the last change you made.
Then, when a given search request requires certain changes to be visible, pass that generation back to
NRTManagerto obtain a searcher that's guaranteed to reflect all changes for that generation.
Here's one example use-case: let's say your site has a forum, and you use Lucene to index and search all posts in the forum. Suddenly a user, Alice, comes online and adds a new post; in your server, you take the text from Alice's post and add it as a document to the index, using
NRTManager.addDocument, saving the returned generation. If she adds multiple posts, just keep the last generation.
Now, if Alice stops posting and runs a search, you'd like to ensure her search covers all the posts she just made. Of course, if your reopen time is fast enough (say once per second), unless Alice types very quickly, any search she runs will already reflect her posts.
But pretend for now you reopen relatively infrequently (say once every 5 or 10 seconds), and you need to be certain Alice's search covers her posts, so you call
NRTManager.waitForGenerationto obtain the
SearcherManagerto use for searching. If the latest searcher already covers the requested generation, the method returns immediately. Otherwise, it blocks, requesting a reopen (see below), until the required generation has become visible in a searcher, and then returns it.
If some other user, say Bob, doesn't add any posts and runs a search, you don't need to wait for Alice's generation to be visible when obtaining the searcher, since it's far less important when Alice's changes become immediately visible to Bob. There's (usually!) no causal connection between Alice posting and Bob searching, so it's fine for Bob to use the most recent searcher.
Another use-case is an index verifier, where you index a document and then immediately search for it to perform end-to-end validation that the document "made it" correctly into the index. That immediate search must first wait for the returned generation to become available.
The power of
NRTManageris you have full control over which searches must see the effects of which indexing changes; this is a further improvement in Lucene's controlled consistency model.
NRTManagerhides all the tricky details of tracking generations.
But: don't abuse this! You may be tempted to always wait for last generation you indexed for all searches, but this would result in very low search throughput on concurrent hardware since all searches would bunch up, waiting for reopens. With proper usage, only a small subset of searches should need to wait for a specific generation, like Alice; the rest will simply use the most recent searcher, like Bob.
Managing reopens is a little trickier with
NRTManager, since you should reopen at higher frequency whenever a search is waiting for a specific generation. To address this, there's the useful
NRTManagerReopenThreadclass; use it like this:
double minStaleSec = 0.025; double maxStaleSec = 5.0; NRTManagerReopenThread thread = new NRTManagerReopenThread( nrtManager, maxStaleSec, minStaleSec); thread.start(); ... thread.close();The
minStaleSecsets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for a specific generation (Alice, above), meaning the longest such a search should have to wait is approximately 25 msec.
maxStaleSecsets a lower bound on how frequently reopens should occur. This is used for the periodic "ordinary" reopens, when there is no request waiting for a specific generation (Bob, above); this means any changes done to the index more than approximately 5.0 seconds ago will be seen when Bob searches. Note that these parameters are approximate targets and not hard guarantees on the reader turnaround time. Be sure to eventually call
thread.close(), when you are done reopening (for example, on shutting down the application).
You are also free to use your own strategy for calling
maybeReopen; you don't have to use
NRTManagerReopenThread. Just remember that getting it right, especially when searches are waiting for specific generations, can be tricky!