Thursday, November 3, 2011

Near-real-time readers with Lucene's SearcherManager and NRTManager

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that's actively changing the index (i.e., it's in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice "best of both worlds" blend between the traditional immediate and eventual consistency models.

Since reopening an NRT reader bypasses the costly commit, and shares some data structures directly in RAM instead of writing/reading to/from files, it provides extremely fast turnaround time on making index changes visible to searchers. Frequent reopens such as every 50 milliseconds, even under relatively high indexing rates, is easily achievable on modern hardware.

Fortunately, it's trivial to use SearcherManager with NRT readers: use the constructor that takes IndexWriter instead of Directory:
  boolean applyAllDeletes = true;
  ExecutorService es = null;
  SearcherManager mgr = new SearcherManager(writer, applyAllDeletes,
                                            new MySearchWarmer(), es);
This tells SearcherManager that its source for new IndexReaders is the provided IndexWriter instance (instead of a Directory instance). After that, use the SearcherManager just as before.

Typically you'll set the applyAllDeletes boolean to true, meaning each reopened reader is required to apply all previous deletion operations (deleteDocuments or updateDocument/s) up until that point.

Sometimes your usage won't require deletions to be applied. For example, perhaps you index multiple versions of each document over time, always deleting the older versions, yet during searching you have some way to ignore the old versions. If that's the case, you can pass applyAllDeletes=false instead. This will make the turnaround time quite a bit faster, as the primary-key lookups required to resolve deletes can be costly. However, if you're using Lucene's trunk (to be eventually released as 4.0), another option is to use MemoryCodec on your id field to greatly reduce the primary-key lookup time.

Note that some or even all of the previous deletes may still be applied even if you pass false. Also, the pending deletes are never lost if you pass false: they remain buffered and will still eventually be applied.

If you have some searches that can tolerate unapplied deletes and others that cannot, it's perfectly fine to create two SearcherManagers, one applying deletes and one not.

If you pass a non-null ExecutorService, then each segment in the index can be searched concurrently; this is a way to gain concurrency within a single search request. Most applications do not require this, because the concurrency across multiple searches is sufficient. It's also not clear that this is effective in general as it adds per-segment overhead, and the available concurrency is a function of your index structure. Perversely, a fully optimized index will have no concurrency! Most applications should pass null.



NRTManager

What if you want the fast turnaround time of NRT readers, but need control over when specific index changes become visible to certain searches? Use NRTManager!

NRTManager holds onto the IndexWriter instance you provide and then exposes the same APIs for making index changes (addDocument/s, updateDocument/s, deleteDocuments). These methods forward to the underlying IndexWriter, but then return a generation token (a Java long) which you can hold onto after making any given change. The generation only increases over time, so if you make a group of changes, just keep the generation returned from the last change you made.

Then, when a given search request requires certain changes to be visible, pass that generation back to NRTManager to obtain a searcher that's guaranteed to reflect all changes for that generation.

Here's one example use-case: let's say your site has a forum, and you use Lucene to index and search all posts in the forum. Suddenly a user, Alice, comes online and adds a new post; in your server, you take the text from Alice's post and add it as a document to the index, using NRTManager.addDocument, saving the returned generation. If she adds multiple posts, just keep the last generation.

Now, if Alice stops posting and runs a search, you'd like to ensure her search covers all the posts she just made. Of course, if your reopen time is fast enough (say once per second), unless Alice types very quickly, any search she runs will already reflect her posts.

But pretend for now you reopen relatively infrequently (say once every 5 or 10 seconds), and you need to be certain Alice's search covers her posts, so you call NRTManager.waitForGeneration to obtain the SearcherManager to use for searching. If the latest searcher already covers the requested generation, the method returns immediately. Otherwise, it blocks, requesting a reopen (see below), until the required generation has become visible in a searcher, and then returns it.

If some other user, say Bob, doesn't add any posts and runs a search, you don't need to wait for Alice's generation to be visible when obtaining the searcher, since it's far less important when Alice's changes become immediately visible to Bob. There's (usually!) no causal connection between Alice posting and Bob searching, so it's fine for Bob to use the most recent searcher.

Another use-case is an index verifier, where you index a document and then immediately search for it to perform end-to-end validation that the document "made it" correctly into the index. That immediate search must first wait for the returned generation to become available.

The power of NRTManager is you have full control over which searches must see the effects of which indexing changes; this is a further improvement in Lucene's controlled consistency model. NRTManager hides all the tricky details of tracking generations.

But: don't abuse this! You may be tempted to always wait for last generation you indexed for all searches, but this would result in very low search throughput on concurrent hardware since all searches would bunch up, waiting for reopens. With proper usage, only a small subset of searches should need to wait for a specific generation, like Alice; the rest will simply use the most recent searcher, like Bob.

Managing reopens is a little trickier with NRTManager, since you should reopen at higher frequency whenever a search is waiting for a specific generation. To address this, there's the useful NRTManagerReopenThread class; use it like this:
  double minStaleSec = 0.025;
  double maxStaleSec = 5.0;
  NRTManagerReopenThread thread = new NRTManagerReopenThread(
                                       nrtManager,
           maxStaleSec,
           minStaleSec);
  thread.start();
  ...
  thread.close();
The minStaleSec sets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for a specific generation (Alice, above), meaning the longest such a search should have to wait is approximately 25 msec.

The maxStaleSec sets a lower bound on how frequently reopens should occur. This is used for the periodic "ordinary" reopens, when there is no request waiting for a specific generation (Bob, above); this means any changes done to the index more than approximately 5.0 seconds ago will be seen when Bob searches. Note that these parameters are approximate targets and not hard guarantees on the reader turnaround time. Be sure to eventually call thread.close(), when you are done reopening (for example, on shutting down the application).

You are also free to use your own strategy for calling maybeReopen; you don't have to use NRTManagerReopenThread. Just remember that getting it right, especially when searches are waiting for specific generations, can be tricky!

41 comments:

  1. I have tried the approach1 with SearcherManager and IndexWriter. However, the returned indexSearcher doesn't return the documents that are not committed to index. Did you miss to mention any thing in this article about that?

    ReplyDelete
  2. ok.. i got it. I need to call maybrReopen() once in a while

    ReplyDelete
  3. Right, you must call maybeReopen periodically... ideally from a separate thread (ie, not a searcher thread), probably the same separate thread that's doing indexing.

    Or, if you use NRTManager then you can use the NRTManagerReopenThread...

    ReplyDelete
  4. Hi, nice article
    I'm using SearcherManager for search
    IndexSearcher searcher = manager.acquire();

    And for code below is used for updating, changes are flushed on disc, but IndexSearcher does not return changed documents, only after application restart and creation of Index

    w.updateDocument(term, createDocument(q));
    IndexReader newReader = IndexReader.openIfChanged(indexReader);
    if (newReader != null) {
    indexReader = newReader;
    w.commit();
    }

    What I am doing wrong ?

    ReplyDelete
  5. How did you init the SearcherManager? I'd recommend passing IndexWriter; this way SearcherManager pulls near-real-time readers from it.

    In your code above, you have to call IndexWriter.commit *before* calling IndexReader.openIfChanged (unless your indexReader was a near-real-time reader).

    ReplyDelete
  6. Is this comment in the SearcherManager javadoc obsolete, mayhap?

    * <p>
    * <b>NOTE</b>: if you have an {@link IndexWriter}, it's better to use
    * {@link NRTManager} since that class pulls near-real-time readers from the
    * IndexWriter.

    ReplyDelete
  7. Hi Benson,

    Indeed that comment is flat out wrong!

    The only reason to use NRTManager over SearcherManager is if you need certain search requests to have exact control over which indexing changes are visible...

    We've removed this comment in the 3.x branch so 3.6.0 will be fixed. Thanks for raising this...

    ReplyDelete
  8. Hi Mike
    Thanks for your posts.

    Do i understand it right that in general if i use NRT-approach, i. e. passing IndexWriter to the constructor of IndexReader, i need a separate thread that periodically calls IndexWriter.commit to persist changes in case of shutdown/process kill/etc. ?

    ReplyDelete
  9. Hi encourage,

    No, you don't need to (and shouldn't!) call IW.commit: this is the whole power of NRT.

    Commit is a very costly operation and the point of NRT is to avoid commit, ie it lets your IndexReader see the still uncommitted changes.

    But what you do need to do is periodically call IndexReader.maybeReopen, so that your searches see recent changes done with the IndexWriter.

    ReplyDelete
  10. Hi, Mike!

    I'm trying to implement the NRT - approach using the 4.0 API.
    The flow of the application is the following:

    1. User is registered. => new document added in the index
    2. Just after the registration user is redirected to a details page, which he may or may not complete.
    3. If user completes the details => the document should be updated.

    As you can see there may be a very short period of time between creating the document and searching for the document ID in order to be able to update it. This is why we decided to call maybeRefresh() without deletion on the ReferenceManager (NRT implementation) after every adDocument().

    Even though I always call for a refresh, in every single test the added Document is not visible when trying to update. Is there something I'm missing?

    LE: the NRTManager uses the IndexWriter.

    ReplyDelete
  11. Hi Ionut,

    You should use SearcherManager.maybeRefreshBlocking if you want to wait until the refresh has completed.

    But it sounds like NRTManager would be a better fit here, since it allows you to only wait for the one case where this user needs to update their document, ie you can ensure the searcher you get back will reflect a specific indexing change from the past.

    ReplyDelete
  12. Hi Mike,let me first tell how useful your blog has been to me.

    I have one correction:
    "The minStaleSec sets an upper bound on how frequently reopens should occur.
    should be:
    "The minStaleSec sets an upper bound on the time someone is required to wait before his search goes through."

    ReplyDelete
  13. Thanks Apostolis!

    I updated the blog post with similar wording to what you suggested.

    ReplyDelete
  14. How to use NRTManager in solr4.0?

    ReplyDelete
  15. Hi Anonymous,

    Unfortunately, Solr doesn't use NRTManager today and I think it would not be straightforward to cutover.

    Are you needing control over which searches see which index changes? It could be Solr does something like this itself already (I'm not sure) ... try emailing solr-user@lucene.apache.org?

    Mike

    ReplyDelete
  16. Hi Mike,

    I am using the bobo browse API. It support zoie for real time indexing. Zoie internally using the lucene. I am trying to create index but came across following error. help will be greatly appreciated.

    Zoie verison 3.0.0. Lucene 2.9.2 OS : windows 7 64bit

    26 Jan 2013 20:50:34,352 ERROR proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@39bc2399 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - Problem copying segments: Cannot overwrite: C:\D-Drive\ProfilerNewJourney\releases\AbbottRegulatory\FacetIndexFinal\event_2.fdt java.io.IOException: Cannot overwrite: C:\D-Drive\releases\FacetIndex_2.fdt at org.apache.lucene.store.FSDirectory.initOutput(FSDirectory.java:362) at org.apache.lucene.store.SimpleFSDirectory.createOutput(SimpleFSDirectory.java:58) at org.apache.lucene.index.FieldsWriter.(FieldsWriter.java:61) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:334) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:153) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5045) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4630) at org.apache.lucene.index.IndexWriter.resolveExternalSegments(IndexWriter.java:3809) at org.apache.lucene.index.IndexWriter.addIndexesNoOptimize(IndexWriter.java:3718) at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:234) at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:212) at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:138) at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:177) at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:380)


    Best Regards,
    Brij

    ReplyDelete
  17. Hi In Search Of,

    It seems likely that an IndexReader has this file open, and that causes the "Cannot overwrite" error?

    However, event_2.fdt isn't a norma Lucene index filename.

    I think you have to ask the bobo browse authors for help... I'm not sure what this code is doing.

    ReplyDelete
    Replies
    1. Thanks Mike. Sorry I edited the path of index file so directory name is merged there.


      Sure. I will contact the bobo browse author. I have created jira ticket in zoie project

      Delete
  18. Mike, am a beginner in Lucene, would you suggest me to jump on Lucene4 or Lucene 3 ??

    going through the API's could see lot of changes in L4 than L3..

    Please suggest.

    Regards,
    Ronald

    ReplyDelete
    Replies
    1. Hi Anonymous,

      I would definitely start with Lucene 4 at this point.

      Delete
  19. Hi Michael,

    We're having an issue where updates are not being picked up by the IndexReader but I'm starting to think it might be related to our particular architecture. The reading is done by a Web app but index updates are done by a completely separate process (Indexer). Once that process is done we have a Unix script that cleans up the index directory (used by the web app) and copies over the new set of files generated by the indexer.

    The way we're trying to handle this on the web app side is to have a scheduled thread that wakes up every 5 mins, grabs a reference to the SearchManager (the same SearchManager used during reading) and then calls manager.maybeReopen().

    When debugging we see a new IndexReader being created. We just don't see the new documents added by the Indexer.

    I hope this explanation is clear enough. Any pointers will be greatly appreciated.

    Thanks,
    MV

    ReplyDelete
    Replies
    1. Hi MV, it looks like you also asked on the Lucene user's list... so I replied there.

      Delete
    2. Well, actually that was somebody else from my team. Whatever I put in my original comment can shed a light on the big picture of the app we built. Like I mentioned before, reading and updating the indexes are processes done by 2 different apps running in different servers. This seems to be an atypical use case as everywhere in Lucene's docs and forums the 'normal' usage seems to be the same code handling both operations. Like my coworker explains the issue seems to be with the new index having the same filenames as the old one. This seem to cause the IndexReaders to point to the old segments.

      Delete
    3. Hmm, this (that you said above) is particularly troubling: "When debugging we see a new IndexReader being created. We just don't see the new documents added by the Indexer.".

      If the index was newly built and copied over, and then the old IndexReader is opened, and indeed a new instance was opened, yet you are still missing documents ... I think there must be that the documents you expected were not in fact indexed? Or, are you certain a new IndexReader was actually opened?

      Delete
    4. Hi Michael,

      We were finally able to fix this issue. This is our understanding of the problem and how we fixed it:
      - At the code level we were deleting all (previous) documents and adding a new set.
      - But, at the OS level we were deleting all files before thinking this was actually a safe approach.

      It turns out that because of this the new Index files ended up with the exact same name as the old one. When we copied over the files and the SearchManager loaded them up we were seeing that although a new IndexReader instance was being created, the underlying readers were still pointing to the 'old' index.

      The way we fixed this is that we stopped deleting files but rather let Lucene take care of the whole thing. After that we started to see new files being created and while the old files were still there the SearchManager was now able to fetch the new set of documents.

      Regards,
      MV

      Delete
    5. OK, that seems like a good solution (IndexWriter.deleteAll); this way the file names will never be reused (Lucene is write-once).

      Delete
  20. This comment has been removed by the author.

    ReplyDelete
  21. Hello Michael,

    Thanks first of all, Your blogs/posts they are very useful when i hit some problem which is internal to Lucene.

    Please if you can help me understand following line which i took from NRTManager class comment

    "You may want to create two NRTManagers, once
    that always applies deletes on refresh and one that does
    not. In this case you should use a single {@link
    NRTManager.TrackingIndexWriter} instance for both."

    Does this mean one with applyDeletes=true should be used by application code which is mostly creates/updates index. and other one applyDeletes=false should be used mainly to acquire() searchers, and used by search threads in application.

    ReplyDelete
    Replies
    1. Hi Jigar,

      This is useful if you have some requests that must show all deletions (such as incoming user searches) and other requests where it doesn't matter (e.g. if you have some automation scripts that run searches looking for specific SKUs or something)... in that case you can simply make two NRTManager instances. This is a fairly esoteric use case, though, and I would start by just making a single instance that always applies deletes and sharing that across both use cases, until/unless you hit performance issues.

      Delete
    2. This comment has been removed by the author.

      Delete
  22. Hi Mike,

    I am implementing NRT and found that 4.4.0 release onwards the Near Real Time Manager (org.apache.lucene.search.NRTManager) has been replaced by ControlledRealTimeReopenThread.

    Please advise should I use ControlledRealTimeReopenThread as described at http://stackoverflow.com/questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread-sample-usage?answertab=votes#tab-top.

    Thanks
    Gaurav Gupta

    ReplyDelete
  23. In Lucene 4.7.2, I think NRTManager is replaced with ControlledRealTimeReopenThread. As NRTManager was not available in the current release am kind of confused. Am trying out using ControlledRealTimeReopenThread but am not sure whether it will be near real-time. Can you provide some example for near real-time search using ControlledRealTimeReopenThread or it should be not used for near real-time?

    ReplyDelete
  24. It's definitely for use with NRT search, and then for real-time search for queries that require it; have a look at its unit tests in a Lucene source installation / svn checkout?

    ReplyDelete
  25. I tried the following code,

    Directory fsDirectory = FSDirectory.open(new File(location));
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47);
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_47, analyzer);
    indexWriterConfig.setRAMBufferSizeMB(16);
    indexWriterConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);

    indexWriter = new IndexWriter(fsDirectory, indexWriterConfig);
    trackingIndexWriter = new TrackingIndexWriter(indexWriter);

    referenceManager = new SearcherManager(indexWriter, true, null);

    controlledRealTimeReopenThread = new ControlledRealTimeReopenThread(trackingIndexWriter,
    referenceManager, 60, 0.1);
    controlledRealTimeReopenThread.setDaemon(true);
    controlledRealTimeReopenThread.start();

    While trying to call this during application initialization, am getting the below exception.
    org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@C:\Users\arun.bc\lucene-home\contractrate\write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89) ~[lucene-core-4.7.2.jar:4.7.2 1586229 - rmuir - 2014-04-10 09:00:35]
    at org.apache.lucene.index.IndexWriter.(IndexWriter.java:707) ~[lucene-core-4.7.2.jar:4.7.2 1586229 - rmuir - 2014-04-10 09:00:35]

    Please suggest...

    ReplyDelete
    Replies
    1. Spring container was initializing the bean twice. I fixed the above issue. Could you please correct me if above implementation is correct for NRT using lucene 4.7.2?

      Delete
    2. That code looks correct!

      Then, for each query, you determine whether it needs the "current" reader or it must wait for a specific indexing generation (because you want to ensure a certain indexing change is visible), when acquiring the searcher.

      Delete
  26. Hi Mike,
    I'm implementing NRTManager in c# using Lucene.Net.Contrib.Management.dll. I load all documents using an IndexWriter:
    <<
    Directory d = new RAMDirectory();
    indexWriter = new IndexWriter(d, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_CURRENT), !IndexReader.IndexExists(d), new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));

    doc = new Document();
    doc.Add(new Field(
    "info",
    info,
    Field.Store.YES,
    Field.Index.ANALYZED));
    // Write the Document to the catalog
    indexWriter.AddDocument(doc);
    >>

    and then initialize the NRTManager with it.
    <<
    static NrtManager man = new NrtManager(indexWriter);
    >>

    When I need to add a new entry to the manager I do this:
    <<
    doc = new Document();
    doc.Add(new Field(
    "Info",
    newInfo,
    //dr["NickName"].ToString(),
    Field.Store.YES,
    Field.Index.ANALYZED));
    // Write the Document to the catalog
    man.AddDocument(doc);
    >>

    At search I ALWAYS do this:
    <<
    if (man.GetSearcherManager().MaybeReopen())
    man.GetSearcherManager().Acquire().Searcher.IndexReader.Reopen();

    var hits = man.GetSearcherManager().Acquire().Searcher.Search(query, 50);
    >>

    My problem is that I'm only able to get one new entry after de initial load. When I add a second entry, the search does not get me this one.

    Can you help me with this, please?

    Thanks,
    Galder.

    ReplyDelete
    Replies
    1. You should not need to call Searcher.IndexReader.Reopen like that, assuming the C# port is like Lucene's. A single call to .maybeRefresh will open a new NRT reader, if there are any changes.

      Also, NRTManager (renamed / factored out a while back to ControlledRealTimeReopenThread in Lucene) is only needed when you have some threads that want a "real-time" reader and other threads that are OK with the current near-real-time reader.

      Maybe you should simplify your test to just use an "ordinary" SearcherManager and see if the problem still happens?

      If so, there must be a bug somewhere in tracking of changes in the C# IndexWriter...

      Delete
  27. Hello Mike,

    Need your help to address the below error while refreshing the lucene index,

    java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code

    We have a batch process which on a daily basis, creates the index and refreshes with old indexes and we do have an api, which will be consuming the indexes mean time.

    We are getting an error from api while this refresh happens.
    Can you help us to know, what is the best practice to refresh the lucene indexes without affecting any existing components which are using the lucene indexes?

    Your suggestions are highly appreciated.

    Regards,
    Pavan

    ReplyDelete
    Replies
    1. Hi Pavan,

      You should use SearcherManager -- it makes it really simple to refresh the searcher while queries are still in flight across multiple threads.

      It's best to ask on the Lucene user's list -- java-user@lucene.apache.org

      Delete
    2. Thanks a lot, Mike. I am using searcher manager to refresh the documents and getting the error. I will reach out to lucene user's list.

      Regards,
      Pavan

      Delete