Changing Bits: Near-real-time readers with Lucene's SearcherManager and NRTManager

Thursday, November 3, 2011

Near-real-time readers with Lucene's SearcherManager and NRTManager

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that's actively changing the index (i.e., it's in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice "best of both worlds" blend between the traditional immediate and eventual consistency models.

Since reopening an NRT reader bypasses the costly commit, and shares some data structures directly in RAM instead of writing/reading to/from files, it provides extremely fast turnaround time on making index changes visible to searchers. Frequent reopens such as every 50 milliseconds, even under relatively high indexing rates, is easily achievable on modern hardware.

Fortunately, it's trivial to use SearcherManager with NRT readers: use the constructor that takes IndexWriter instead of Directory:

  boolean applyAllDeletes = true;
  ExecutorService es = null;
  SearcherManager mgr = new SearcherManager(writer, applyAllDeletes,
                                            new MySearchWarmer(), es);

This tells SearcherManager that its source for new IndexReaders is the provided IndexWriter instance (instead of a Directory instance). After that, use the SearcherManager just as before.

Typically you'll set the applyAllDeletes boolean to true, meaning each reopened reader is required to apply all previous deletion operations (deleteDocuments or updateDocument/s) up until that point.

Sometimes your usage won't require deletions to be applied. For example, perhaps you index multiple versions of each document over time, always deleting the older versions, yet during searching you have some way to ignore the old versions. If that's the case, you can pass applyAllDeletes=false instead. This will make the turnaround time quite a bit faster, as the primary-key lookups required to resolve deletes can be costly. However, if you're using Lucene's trunk (to be eventually released as 4.0), another option is to use MemoryCodec on your id field to greatly reduce the primary-key lookup time.

Note that some or even all of the previous deletes may still be applied even if you pass false. Also, the pending deletes are never lost if you pass false: they remain buffered and will still eventually be applied.

If you have some searches that can tolerate unapplied deletes and others that cannot, it's perfectly fine to create two SearcherManagers, one applying deletes and one not.

If you pass a non-null ExecutorService, then each segment in the index can be searched concurrently; this is a way to gain concurrency within a single search request. Most applications do not require this, because the concurrency across multiple searches is sufficient. It's also not clear that this is effective in general as it adds per-segment overhead, and the available concurrency is a function of your index structure. Perversely, a fully optimized index will have no concurrency! Most applications should pass null.

NRTManager

What if you want the fast turnaround time of NRT readers, but need control over when specific index changes become visible to certain searches? Use NRTManager!

NRTManager holds onto the IndexWriter instance you provide and then exposes the same APIs for making index changes (addDocument/s, updateDocument/s, deleteDocuments). These methods forward to the underlying IndexWriter, but then return a generation token (a Java long) which you can hold onto after making any given change. The generation only increases over time, so if you make a group of changes, just keep the generation returned from the last change you made.

Then, when a given search request requires certain changes to be visible, pass that generation back to NRTManager to obtain a searcher that's guaranteed to reflect all changes for that generation.

Here's one example use-case: let's say your site has a forum, and you use Lucene to index and search all posts in the forum. Suddenly a user, Alice, comes online and adds a new post; in your server, you take the text from Alice's post and add it as a document to the index, using NRTManager.addDocument, saving the returned generation. If she adds multiple posts, just keep the last generation.

Now, if Alice stops posting and runs a search, you'd like to ensure her search covers all the posts she just made. Of course, if your reopen time is fast enough (say once per second), unless Alice types very quickly, any search she runs will already reflect her posts.

But pretend for now you reopen relatively infrequently (say once every 5 or 10 seconds), and you need to be certain Alice's search covers her posts, so you call NRTManager.waitForGeneration to obtain the SearcherManager to use for searching. If the latest searcher already covers the requested generation, the method returns immediately. Otherwise, it blocks, requesting a reopen (see below), until the required generation has become visible in a searcher, and then returns it.

If some other user, say Bob, doesn't add any posts and runs a search, you don't need to wait for Alice's generation to be visible when obtaining the searcher, since it's far less important when Alice's changes become immediately visible to Bob. There's (usually!) no causal connection between Alice posting and Bob searching, so it's fine for Bob to use the most recent searcher.

Another use-case is an index verifier, where you index a document and then immediately search for it to perform end-to-end validation that the document "made it" correctly into the index. That immediate search must first wait for the returned generation to become available.

The power of NRTManager is you have full control over which searches must see the effects of which indexing changes; this is a further improvement in Lucene's controlled consistency model. NRTManager hides all the tricky details of tracking generations.

But: don't abuse this! You may be tempted to always wait for last generation you indexed for all searches, but this would result in very low search throughput on concurrent hardware since all searches would bunch up, waiting for reopens. With proper usage, only a small subset of searches should need to wait for a specific generation, like Alice; the rest will simply use the most recent searcher, like Bob.

Managing reopens is a little trickier with NRTManager, since you should reopen at higher frequency whenever a search is waiting for a specific generation. To address this, there's the useful NRTManagerReopenThread class; use it like this:

  double minStaleSec = 0.025;
  double maxStaleSec = 5.0;
  NRTManagerReopenThread thread = new NRTManagerReopenThread(
                                       nrtManager,
           maxStaleSec,
           minStaleSec);
  thread.start();
  ...
  thread.close();

The minStaleSec sets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for a specific generation (Alice, above), meaning the longest such a search should have to wait is approximately 25 msec.

The maxStaleSec sets a lower bound on how frequently reopens should occur. This is used for the periodic "ordinary" reopens, when there is no request waiting for a specific generation (Bob, above); this means any changes done to the index more than approximately 5.0 seconds ago will be seen when Bob searches. Note that these parameters are approximate targets and not hard guarantees on the reader turnaround time. Be sure to eventually call thread.close(), when you are done reopening (for example, on shutting down the application).

You are also free to use your own strategy for calling maybeReopen; you don't have to use NRTManagerReopenThread. Just remember that getting it right, especially when searches are waiting for specific generations, can be tricky!

41 comments:

nareshJanuary 24, 2012 at 5:12 AM
I have tried the approach1 with SearcherManager and IndexWriter. However, the returned indexSearcher doesn't return the documents that are not committed to index. Did you miss to mention any thing in this article about that?
ReplyDelete
Replies
nareshJanuary 24, 2012 at 6:06 AM
ok.. i got it. I need to call maybrReopen() once in a while
ReplyDelete
Replies
Michael McCandlessJanuary 24, 2012 at 8:24 AM
Right, you must call maybeReopen periodically... ideally from a separate thread (ie, not a searcher thread), probably the same separate thread that's doing indexing.

Or, if you use NRTManager then you can use the NRTManagerReopenThread...
ReplyDelete
Replies
AnonymousFebruary 6, 2012 at 9:34 AM
Hi, nice article
I'm using SearcherManager for search
IndexSearcher searcher = manager.acquire();

And for code below is used for updating, changes are flushed on disc, but IndexSearcher does not return changed documents, only after application restart and creation of Index

w.updateDocument(term, createDocument(q));
IndexReader newReader = IndexReader.openIfChanged(indexReader);
if (newReader != null) {
indexReader = newReader;
w.commit();
}

What I am doing wrong ?
ReplyDelete
Replies
Michael McCandlessFebruary 6, 2012 at 11:49 AM
How did you init the SearcherManager? I'd recommend passing IndexWriter; this way SearcherManager pulls near-real-time readers from it.

In your code above, you have to call IndexWriter.commit *before* calling IndexReader.openIfChanged (unless your indexReader was a near-real-time reader).
ReplyDelete
Replies
Benson MarguliesFebruary 18, 2012 at 9:50 PM
Is this comment in the SearcherManager javadoc obsolete, mayhap?

* <p>
* <b>NOTE</b>: if you have an {@link IndexWriter}, it's better to use
* {@link NRTManager} since that class pulls near-real-time readers from the
* IndexWriter.
ReplyDelete
Replies
Michael McCandlessFebruary 19, 2012 at 5:48 AM
Hi Benson,

Indeed that comment is flat out wrong!

The only reason to use NRTManager over SearcherManager is if you need certain search requests to have exact control over which indexing changes are visible...

We've removed this comment in the 3.x branch so 3.6.0 will be fixed. Thanks for raising this...
ReplyDelete
Replies
encourageJuly 6, 2012 at 7:35 AM
Hi Mike
Thanks for your posts.

Do i understand it right that in general if i use NRT-approach, i. e. passing IndexWriter to the constructor of IndexReader, i need a separate thread that periodically calls IndexWriter.commit to persist changes in case of shutdown/process kill/etc. ?
ReplyDelete
Replies
Michael McCandlessJuly 6, 2012 at 8:10 AM
Hi encourage,

No, you don't need to (and shouldn't!) call IW.commit: this is the whole power of NRT.

Commit is a very costly operation and the point of NRT is to avoid commit, ie it lets your IndexReader see the still uncommitted changes.

But what you do need to do is periodically call IndexReader.maybeReopen, so that your searches see recent changes done with the IndexWriter.
ReplyDelete
Replies
UnknownNovember 9, 2012 at 4:25 PM
Hi, Mike!

I'm trying to implement the NRT - approach using the 4.0 API.
The flow of the application is the following:

1. User is registered. => new document added in the index
2. Just after the registration user is redirected to a details page, which he may or may not complete.
3. If user completes the details => the document should be updated.

As you can see there may be a very short period of time between creating the document and searching for the document ID in order to be able to update it. This is why we decided to call maybeRefresh() without deletion on the ReferenceManager (NRT implementation) after every adDocument().

Even though I always call for a refresh, in every single test the added Document is not visible when trying to update. Is there something I'm missing?

LE: the NRTManager uses the IndexWriter.
ReplyDelete
Replies
Michael McCandlessNovember 11, 2012 at 1:52 PM
Hi Ionut,

You should use SearcherManager.maybeRefreshBlocking if you want to wait until the refresh has completed.

But it sounds like NRTManager would be a better fit here, since it allows you to only wait for the one case where this user needs to update their document, ie you can ensure the searcher you get back will reflect a specific indexing change from the past.
ReplyDelete
Replies
ApostolisDecember 21, 2012 at 2:27 PM
Hi Mike,let me first tell how useful your blog has been to me.

I have one correction:
"The minStaleSec sets an upper bound on how frequently reopens should occur.
should be:
"The minStaleSec sets an upper bound on the time someone is required to wait before his search goes through."

ReplyDelete
Replies
Michael McCandlessDecember 22, 2012 at 5:25 AM
Thanks Apostolis!

I updated the blog post with similar wording to what you suggested.
ReplyDelete
Replies
AnonymousDecember 28, 2012 at 5:39 AM
How to use NRTManager in solr4.0?
ReplyDelete
Replies
Michael McCandlessDecember 30, 2012 at 6:59 AM
Hi Anonymous,

Unfortunately, Solr doesn't use NRTManager today and I think it would not be straightforward to cutover.

Are you needing control over which searches see which index changes? It could be Solr does something like this itself already (I'm not sure) ... try emailing solr-user@lucene.apache.org?

Mike
ReplyDelete
Replies
BrijrajInSearchOfJanuary 26, 2013 at 11:27 PM
Hi Mike,

I am using the bobo browse API. It support zoie for real time indexing. Zoie internally using the lucene. I am trying to create index but came across following error. help will be greatly appreciated.

Zoie verison 3.0.0. Lucene 2.9.2 OS : windows 7 64bit

26 Jan 2013 20:50:34,352 ERROR proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@39bc2399 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - Problem copying segments: Cannot overwrite: C:\D-Drive\ProfilerNewJourney\releases\AbbottRegulatory\FacetIndexFinal\event_2.fdt java.io.IOException: Cannot overwrite: C:\D-Drive\releases\FacetIndex_2.fdt at org.apache.lucene.store.FSDirectory.initOutput(FSDirectory.java:362) at org.apache.lucene.store.SimpleFSDirectory.createOutput(SimpleFSDirectory.java:58) at org.apache.lucene.index.FieldsWriter.(FieldsWriter.java:61) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:334) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:153) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5045) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4630) at org.apache.lucene.index.IndexWriter.resolveExternalSegments(IndexWriter.java:3809) at org.apache.lucene.index.IndexWriter.addIndexesNoOptimize(IndexWriter.java:3718) at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:234) at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:212) at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:138) at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:177) at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:380)

Best Regards,
Brij
ReplyDelete
Replies
Michael McCandlessJanuary 27, 2013 at 5:23 AM
Hi In Search Of,

It seems likely that an IndexReader has this file open, and that causes the "Cannot overwrite" error?

However, event_2.fdt isn't a norma Lucene index filename.

I think you have to ask the bobo browse authors for help... I'm not sure what this code is doing.
ReplyDelete
Replies
AnonymousMay 15, 2013 at 2:40 AM
Mike, am a beginner in Lucene, would you suggest me to jump on Lucene4 or Lucene 3 ??

going through the API's could see lot of changes in L4 than L3..

Please suggest.

Regards,
Ronald
ReplyDelete
Replies
MVNovember 6, 2013 at 5:54 PM
Hi Michael,

We're having an issue where updates are not being picked up by the IndexReader but I'm starting to think it might be related to our particular architecture. The reading is done by a Web app but index updates are done by a completely separate process (Indexer). Once that process is done we have a Unix script that cleans up the index directory (used by the web app) and copies over the new set of files generated by the indexer.

The way we're trying to handle this on the web app side is to have a scheduled thread that wakes up every 5 mins, grabs a reference to the SearchManager (the same SearchManager used during reading) and then calls manager.maybeReopen().

When debugging we see a new IndexReader being created. We just don't see the new documents added by the Indexer.

I hope this explanation is clear enough. Any pointers will be greatly appreciated.

Thanks,
MV
ReplyDelete
Replies
JigarFebruary 3, 2014 at 5:17 AM
This comment has been removed by the author.
ReplyDelete
Replies
JigarFebruary 3, 2014 at 5:23 AM
Hello Michael,

Thanks first of all, Your blogs/posts they are very useful when i hit some problem which is internal to Lucene.

Please if you can help me understand following line which i took from NRTManager class comment

"You may want to create two NRTManagers, once
that always applies deletes on refresh and one that does
not. In this case you should use a single {@link
NRTManager.TrackingIndexWriter} instance for both."

Does this mean one with applyDeletes=true should be used by application code which is mostly creates/updates index. and other one applyDeletes=false should be used mainly to acquire() searchers, and used by search threads in application.
ReplyDelete
Replies
Gaurav GuptaMay 28, 2014 at 10:37 AM
Hi Mike,

I am implementing NRT and found that 4.4.0 release onwards the Near Real Time Manager (org.apache.lucene.search.NRTManager) has been replaced by ControlledRealTimeReopenThread.

Please advise should I use ControlledRealTimeReopenThread as described at http://stackoverflow.com/questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread-sample-usage?answertab=votes#tab-top.

Thanks
Gaurav Gupta
ReplyDelete
Replies
Arun BCJune 25, 2014 at 5:53 AM
In Lucene 4.7.2, I think NRTManager is replaced with ControlledRealTimeReopenThread. As NRTManager was not available in the current release am kind of confused. Am trying out using ControlledRealTimeReopenThread but am not sure whether it will be near real-time. Can you provide some example for near real-time search using ControlledRealTimeReopenThread or it should be not used for near real-time?
ReplyDelete
Replies
Michael McCandlessJune 25, 2014 at 8:52 AM
It's definitely for use with NRT search, and then for real-time search for queries that require it; have a look at its unit tests in a Lucene source installation / svn checkout?
ReplyDelete
Replies
Arun BCJune 26, 2014 at 9:15 AM
I tried the following code,

Directory fsDirectory = FSDirectory.open(new File(location));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_47, analyzer);
indexWriterConfig.setRAMBufferSizeMB(16);
indexWriterConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);

indexWriter = new IndexWriter(fsDirectory, indexWriterConfig);
trackingIndexWriter = new TrackingIndexWriter(indexWriter);

referenceManager = new SearcherManager(indexWriter, true, null);

controlledRealTimeReopenThread = new ControlledRealTimeReopenThread(trackingIndexWriter,
referenceManager, 60, 0.1);
controlledRealTimeReopenThread.setDaemon(true);
controlledRealTimeReopenThread.start();

While trying to call this during application initialization, am getting the below exception.
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@C:\Users\arun.bc\lucene-home\contractrate\write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:89) ~[lucene-core-4.7.2.jar:4.7.2 1586229 - rmuir - 2014-04-10 09:00:35]
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:707) ~[lucene-core-4.7.2.jar:4.7.2 1586229 - rmuir - 2014-04-10 09:00:35]

Please suggest...
ReplyDelete
Replies
UnknownOctober 17, 2014 at 5:27 AM
Hi Mike,
I'm implementing NRTManager in c# using Lucene.Net.Contrib.Management.dll. I load all documents using an IndexWriter:
<<
Directory d = new RAMDirectory();
indexWriter = new IndexWriter(d, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_CURRENT), !IndexReader.IndexExists(d), new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));

doc = new Document();
doc.Add(new Field(
"info",
info,
Field.Store.YES,
Field.Index.ANALYZED));
// Write the Document to the catalog
indexWriter.AddDocument(doc);
>>

and then initialize the NRTManager with it.
<<
static NrtManager man = new NrtManager(indexWriter);
>>

When I need to add a new entry to the manager I do this:
<<
doc = new Document();
doc.Add(new Field(
"Info",
newInfo,
//dr["NickName"].ToString(),
Field.Store.YES,
Field.Index.ANALYZED));
// Write the Document to the catalog
man.AddDocument(doc);
>>

At search I ALWAYS do this:
<<
if (man.GetSearcherManager().MaybeReopen())
man.GetSearcherManager().Acquire().Searcher.IndexReader.Reopen();

var hits = man.GetSearcherManager().Acquire().Searcher.Search(query, 50);
>>

My problem is that I'm only able to get one new entry after de initial load. When I add a second entry, the search does not get me this one.

Can you help me with this, please?

Thanks,
Galder.
ReplyDelete
Replies
PavanDecember 12, 2018 at 1:29 AM
Hello Mike,

Need your help to address the below error while refreshing the lucene index,

java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code

We have a batch process which on a daily basis, creates the index and refreshes with old indexes and we do have an api, which will be consuming the indexes mean time.

We are getting an error from api while this refresh happens.
Can you help us to know, what is the best practice to refresh the lucene indexes without affecting any existing components which are using the lucene indexes?

Your suggestions are highly appreciated.

Regards,
Pavan
ReplyDelete
Replies

Add comment