SearcherManager
class,
coming in the next (3.5.0) Lucene release, to periodically reopen your
IndexSearcher
when multiple threads need to share it.
This class presents a very
simple acquire
/release
API, hiding the
thread-safe complexities of opening and closing the
underlying IndexReader
s.
But that example used a non near-real-time (NRT)
IndexReader
, which has relatively high turnaround time
for index changes to become visible, since you must call
IndexWriter.commit
first.
If you have access to the
IndexWriter
that's actively
changing the index (i.e., it's in the same JVM as your searchers), use
an NRT reader instead! NRT readers let you
decouple durability to hardware/OS crashes
from visibility of changes to a new IndexReader
.
How frequently you commit (for durability) and how frequently you
reopen (to see new changes) become fully separate decisions.
This controlled
consistency model that Lucene exposes is a nice "best of both
worlds" blend between the
traditional immediate
and eventual
consistency models.
Since reopening an NRT reader bypasses the costly commit, and shares some data structures directly in RAM instead of writing/reading to/from files, it provides extremely fast turnaround time on making index changes visible to searchers. Frequent reopens such as every 50 milliseconds, even under relatively high indexing rates, is easily achievable on modern hardware.
Fortunately, it's trivial to use
SearcherManager
with NRT
readers: use the constructor that takes IndexWriter
instead of Directory
:
boolean applyAllDeletes = true; ExecutorService es = null; SearcherManager mgr = new SearcherManager(writer, applyAllDeletes, new MySearchWarmer(), es);This tells
SearcherManager
that its source for new
IndexReader
s is the provided IndexWriter
instance (instead of a Directory
instance). After that,
use the SearcherManager
just
as before.
Typically you'll set the
applyAllDeletes
boolean to
true
, meaning each reopened reader is required to apply
all previous deletion operations (deleteDocuments
or updateDocument/s
) up until that point.
Sometimes your usage won't require deletions to be applied. For example, perhaps you index multiple versions of each document over time, always deleting the older versions, yet during searching you have some way to ignore the old versions. If that's the case, you can pass
applyAllDeletes=false
instead. This will make the
turnaround time quite a bit faster, as the primary-key lookups
required to resolve deletes can be costly. However, if you're using
Lucene's trunk (to be eventually released as 4.0), another option is
to use MemoryCodec
on your id
field
to greatly
reduce the primary-key lookup time.
Note that some or even all of the previous deletes may still be applied even if you pass
false
. Also, the pending
deletes are never lost if you pass false
: they
remain buffered and will still eventually be applied.
If you have some searches that can tolerate unapplied deletes and others that cannot, it's perfectly fine to create two
SearcherManager
s, one applying deletes and one not.
If you pass a non-null
ExecutorService
, then each segment
in the index can be searched concurrently; this is a way to gain
concurrency within a single search request. Most applications do not
require this, because the concurrency across multiple searches is
sufficient. It's also not clear that this is effective in general as
it adds per-segment overhead, and the available concurrency is a
function of your index structure. Perversely, a fully optimized index
will have no concurrency! Most applications should pass
null
.
NRTManager
What if you want the fast turnaround time of NRT readers, but need control over when specific index changes become visible to certain searches? Use
NRTManager
!
NRTManager
holds onto the IndexWriter
instance you provide and then exposes the same APIs for making index
changes (addDocument/s
, updateDocument/s
,
deleteDocuments
). These methods forward to the
underlying IndexWriter
, but then return a
generation token (a Java long
) which you can
hold onto after making any given change. The generation only
increases over time, so if you make a group of changes, just keep the
generation returned from the last change you made.
Then, when a given search request requires certain changes to be visible, pass that generation back to
NRTManager
to obtain a searcher that's guaranteed to
reflect all changes for that generation.
Here's one example use-case: let's say your site has a forum, and you use Lucene to index and search all posts in the forum. Suddenly a user, Alice, comes online and adds a new post; in your server, you take the text from Alice's post and add it as a document to the index, using
NRTManager.addDocument
, saving the returned generation.
If she adds multiple posts, just keep the last generation.
Now, if Alice stops posting and runs a search, you'd like to ensure her search covers all the posts she just made. Of course, if your reopen time is fast enough (say once per second), unless Alice types very quickly, any search she runs will already reflect her posts.
But pretend for now you reopen relatively infrequently (say once every 5 or 10 seconds), and you need to be certain Alice's search covers her posts, so you call
NRTManager.waitForGeneration
to obtain
the SearcherManager
to use for searching. If the latest
searcher already covers the requested generation, the method returns
immediately. Otherwise, it blocks, requesting a reopen (see below),
until the required generation has become visible in a searcher, and
then returns it.
If some other user, say Bob, doesn't add any posts and runs a search, you don't need to wait for Alice's generation to be visible when obtaining the searcher, since it's far less important when Alice's changes become immediately visible to Bob. There's (usually!) no causal connection between Alice posting and Bob searching, so it's fine for Bob to use the most recent searcher.
Another use-case is an index verifier, where you index a document and then immediately search for it to perform end-to-end validation that the document "made it" correctly into the index. That immediate search must first wait for the returned generation to become available.
The power of
NRTManager
is you have full control over
which searches must see the effects of which indexing changes; this is
a further improvement in Lucene's controlled consistency
model. NRTManager
hides all the tricky details of
tracking generations.
But: don't abuse this! You may be tempted to always wait for last generation you indexed for all searches, but this would result in very low search throughput on concurrent hardware since all searches would bunch up, waiting for reopens. With proper usage, only a small subset of searches should need to wait for a specific generation, like Alice; the rest will simply use the most recent searcher, like Bob.
Managing reopens is a little trickier with
NRTManager
,
since you should reopen at higher frequency whenever a search is
waiting for a specific generation. To address this, there's the
useful NRTManagerReopenThread
class; use it like this:
double minStaleSec = 0.025; double maxStaleSec = 5.0; NRTManagerReopenThread thread = new NRTManagerReopenThread( nrtManager, maxStaleSec, minStaleSec); thread.start(); ... thread.close();The
minStaleSec
sets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for
a specific generation (Alice, above), meaning the longest such a search
should have to wait is approximately 25 msec.
The
maxStaleSec
sets a lower bound on how frequently
reopens should occur. This is used for the periodic "ordinary"
reopens, when there is no request waiting for a specific generation
(Bob, above); this means any changes done to the index more than
approximately 5.0 seconds ago will be seen when Bob searches. Note
that these parameters are approximate targets and not hard guarantees
on the reader turnaround time. Be sure to eventually
call thread.close()
, when you are done reopening (for
example, on shutting down the application).
You are also free to use your own strategy for calling
maybeReopen
; you don't have to
use NRTManagerReopenThread
. Just remember that getting
it right, especially when searches are waiting for specific
generations, can be tricky!
I have tried the approach1 with SearcherManager and IndexWriter. However, the returned indexSearcher doesn't return the documents that are not committed to index. Did you miss to mention any thing in this article about that?
ReplyDeleteok.. i got it. I need to call maybrReopen() once in a while
ReplyDeleteRight, you must call maybeReopen periodically... ideally from a separate thread (ie, not a searcher thread), probably the same separate thread that's doing indexing.
ReplyDeleteOr, if you use NRTManager then you can use the NRTManagerReopenThread...
Hi, nice article
ReplyDeleteI'm using SearcherManager for search
IndexSearcher searcher = manager.acquire();
And for code below is used for updating, changes are flushed on disc, but IndexSearcher does not return changed documents, only after application restart and creation of Index
w.updateDocument(term, createDocument(q));
IndexReader newReader = IndexReader.openIfChanged(indexReader);
if (newReader != null) {
indexReader = newReader;
w.commit();
}
What I am doing wrong ?
How did you init the SearcherManager? I'd recommend passing IndexWriter; this way SearcherManager pulls near-real-time readers from it.
ReplyDeleteIn your code above, you have to call IndexWriter.commit *before* calling IndexReader.openIfChanged (unless your indexReader was a near-real-time reader).
Is this comment in the SearcherManager javadoc obsolete, mayhap?
ReplyDelete* <p>
* <b>NOTE</b>: if you have an {@link IndexWriter}, it's better to use
* {@link NRTManager} since that class pulls near-real-time readers from the
* IndexWriter.
Hi Benson,
ReplyDeleteIndeed that comment is flat out wrong!
The only reason to use NRTManager over SearcherManager is if you need certain search requests to have exact control over which indexing changes are visible...
We've removed this comment in the 3.x branch so 3.6.0 will be fixed. Thanks for raising this...
Hi Mike
ReplyDeleteThanks for your posts.
Do i understand it right that in general if i use NRT-approach, i. e. passing IndexWriter to the constructor of IndexReader, i need a separate thread that periodically calls IndexWriter.commit to persist changes in case of shutdown/process kill/etc. ?
Hi encourage,
ReplyDeleteNo, you don't need to (and shouldn't!) call IW.commit: this is the whole power of NRT.
Commit is a very costly operation and the point of NRT is to avoid commit, ie it lets your IndexReader see the still uncommitted changes.
But what you do need to do is periodically call IndexReader.maybeReopen, so that your searches see recent changes done with the IndexWriter.
Hi, Mike!
ReplyDeleteI'm trying to implement the NRT - approach using the 4.0 API.
The flow of the application is the following:
1. User is registered. => new document added in the index
2. Just after the registration user is redirected to a details page, which he may or may not complete.
3. If user completes the details => the document should be updated.
As you can see there may be a very short period of time between creating the document and searching for the document ID in order to be able to update it. This is why we decided to call maybeRefresh() without deletion on the ReferenceManager (NRT implementation) after every adDocument().
Even though I always call for a refresh, in every single test the added Document is not visible when trying to update. Is there something I'm missing?
LE: the NRTManager uses the IndexWriter.
Hi Ionut,
ReplyDeleteYou should use SearcherManager.maybeRefreshBlocking if you want to wait until the refresh has completed.
But it sounds like NRTManager would be a better fit here, since it allows you to only wait for the one case where this user needs to update their document, ie you can ensure the searcher you get back will reflect a specific indexing change from the past.
Hi Mike,let me first tell how useful your blog has been to me.
ReplyDeleteI have one correction:
"The minStaleSec sets an upper bound on how frequently reopens should occur.
should be:
"The minStaleSec sets an upper bound on the time someone is required to wait before his search goes through."
Thanks Apostolis!
ReplyDeleteI updated the blog post with similar wording to what you suggested.
How to use NRTManager in solr4.0?
ReplyDeleteHi Anonymous,
ReplyDeleteUnfortunately, Solr doesn't use NRTManager today and I think it would not be straightforward to cutover.
Are you needing control over which searches see which index changes? It could be Solr does something like this itself already (I'm not sure) ... try emailing solr-user@lucene.apache.org?
Mike
Hi Mike,
ReplyDeleteI am using the bobo browse API. It support zoie for real time indexing. Zoie internally using the lucene. I am trying to create index but came across following error. help will be greatly appreciated.
Zoie verison 3.0.0. Lucene 2.9.2 OS : windows 7 64bit
26 Jan 2013 20:50:34,352 ERROR proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader@39bc2399 proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - Problem copying segments: Cannot overwrite: C:\D-Drive\ProfilerNewJourney\releases\AbbottRegulatory\FacetIndexFinal\event_2.fdt java.io.IOException: Cannot overwrite: C:\D-Drive\releases\FacetIndex_2.fdt at org.apache.lucene.store.FSDirectory.initOutput(FSDirectory.java:362) at org.apache.lucene.store.SimpleFSDirectory.createOutput(SimpleFSDirectory.java:58) at org.apache.lucene.index.FieldsWriter.(FieldsWriter.java:61) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:334) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:153) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5045) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4630) at org.apache.lucene.index.IndexWriter.resolveExternalSegments(IndexWriter.java:3809) at org.apache.lucene.index.IndexWriter.addIndexesNoOptimize(IndexWriter.java:3718) at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:234) at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:212) at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:138) at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:177) at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:380)
Best Regards,
Brij
Hi In Search Of,
ReplyDeleteIt seems likely that an IndexReader has this file open, and that causes the "Cannot overwrite" error?
However, event_2.fdt isn't a norma Lucene index filename.
I think you have to ask the bobo browse authors for help... I'm not sure what this code is doing.
Thanks Mike. Sorry I edited the path of index file so directory name is merged there.
DeleteSure. I will contact the bobo browse author. I have created jira ticket in zoie project
Mike, am a beginner in Lucene, would you suggest me to jump on Lucene4 or Lucene 3 ??
ReplyDeletegoing through the API's could see lot of changes in L4 than L3..
Please suggest.
Regards,
Ronald
Hi Anonymous,
DeleteI would definitely start with Lucene 4 at this point.
Hi Michael,
ReplyDeleteWe're having an issue where updates are not being picked up by the IndexReader but I'm starting to think it might be related to our particular architecture. The reading is done by a Web app but index updates are done by a completely separate process (Indexer). Once that process is done we have a Unix script that cleans up the index directory (used by the web app) and copies over the new set of files generated by the indexer.
The way we're trying to handle this on the web app side is to have a scheduled thread that wakes up every 5 mins, grabs a reference to the SearchManager (the same SearchManager used during reading) and then calls manager.maybeReopen().
When debugging we see a new IndexReader being created. We just don't see the new documents added by the Indexer.
I hope this explanation is clear enough. Any pointers will be greatly appreciated.
Thanks,
MV
Hi MV, it looks like you also asked on the Lucene user's list... so I replied there.
DeleteWell, actually that was somebody else from my team. Whatever I put in my original comment can shed a light on the big picture of the app we built. Like I mentioned before, reading and updating the indexes are processes done by 2 different apps running in different servers. This seems to be an atypical use case as everywhere in Lucene's docs and forums the 'normal' usage seems to be the same code handling both operations. Like my coworker explains the issue seems to be with the new index having the same filenames as the old one. This seem to cause the IndexReaders to point to the old segments.
DeleteHmm, this (that you said above) is particularly troubling: "When debugging we see a new IndexReader being created. We just don't see the new documents added by the Indexer.".
DeleteIf the index was newly built and copied over, and then the old IndexReader is opened, and indeed a new instance was opened, yet you are still missing documents ... I think there must be that the documents you expected were not in fact indexed? Or, are you certain a new IndexReader was actually opened?
Hi Michael,
DeleteWe were finally able to fix this issue. This is our understanding of the problem and how we fixed it:
- At the code level we were deleting all (previous) documents and adding a new set.
- But, at the OS level we were deleting all files before thinking this was actually a safe approach.
It turns out that because of this the new Index files ended up with the exact same name as the old one. When we copied over the files and the SearchManager loaded them up we were seeing that although a new IndexReader instance was being created, the underlying readers were still pointing to the 'old' index.
The way we fixed this is that we stopped deleting files but rather let Lucene take care of the whole thing. After that we started to see new files being created and while the old files were still there the SearchManager was now able to fetch the new set of documents.
Regards,
MV
OK, that seems like a good solution (IndexWriter.deleteAll); this way the file names will never be reused (Lucene is write-once).
DeleteThis comment has been removed by the author.
ReplyDeleteHello Michael,
ReplyDeleteThanks first of all, Your blogs/posts they are very useful when i hit some problem which is internal to Lucene.
Please if you can help me understand following line which i took from NRTManager class comment
"You may want to create two NRTManagers, once
that always applies deletes on refresh and one that does
not. In this case you should use a single {@link
NRTManager.TrackingIndexWriter} instance for both."
Does this mean one with applyDeletes=true should be used by application code which is mostly creates/updates index. and other one applyDeletes=false should be used mainly to acquire() searchers, and used by search threads in application.
Hi Jigar,
DeleteThis is useful if you have some requests that must show all deletions (such as incoming user searches) and other requests where it doesn't matter (e.g. if you have some automation scripts that run searches looking for specific SKUs or something)... in that case you can simply make two NRTManager instances. This is a fairly esoteric use case, though, and I would start by just making a single instance that always applies deletes and sharing that across both use cases, until/unless you hit performance issues.
This comment has been removed by the author.
DeleteHi Mike,
ReplyDeleteI am implementing NRT and found that 4.4.0 release onwards the Near Real Time Manager (org.apache.lucene.search.NRTManager) has been replaced by ControlledRealTimeReopenThread.
Please advise should I use ControlledRealTimeReopenThread as described at http://stackoverflow.com/questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread-sample-usage?answertab=votes#tab-top.
Thanks
Gaurav Gupta
In Lucene 4.7.2, I think NRTManager is replaced with ControlledRealTimeReopenThread. As NRTManager was not available in the current release am kind of confused. Am trying out using ControlledRealTimeReopenThread but am not sure whether it will be near real-time. Can you provide some example for near real-time search using ControlledRealTimeReopenThread or it should be not used for near real-time?
ReplyDeleteIt's definitely for use with NRT search, and then for real-time search for queries that require it; have a look at its unit tests in a Lucene source installation / svn checkout?
ReplyDeleteI tried the following code,
ReplyDeleteDirectory fsDirectory = FSDirectory.open(new File(location));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_47, analyzer);
indexWriterConfig.setRAMBufferSizeMB(16);
indexWriterConfig.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexWriter = new IndexWriter(fsDirectory, indexWriterConfig);
trackingIndexWriter = new TrackingIndexWriter(indexWriter);
referenceManager = new SearcherManager(indexWriter, true, null);
controlledRealTimeReopenThread = new ControlledRealTimeReopenThread(trackingIndexWriter,
referenceManager, 60, 0.1);
controlledRealTimeReopenThread.setDaemon(true);
controlledRealTimeReopenThread.start();
While trying to call this during application initialization, am getting the below exception.
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@C:\Users\arun.bc\lucene-home\contractrate\write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:89) ~[lucene-core-4.7.2.jar:4.7.2 1586229 - rmuir - 2014-04-10 09:00:35]
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:707) ~[lucene-core-4.7.2.jar:4.7.2 1586229 - rmuir - 2014-04-10 09:00:35]
Please suggest...
Spring container was initializing the bean twice. I fixed the above issue. Could you please correct me if above implementation is correct for NRT using lucene 4.7.2?
DeleteThat code looks correct!
DeleteThen, for each query, you determine whether it needs the "current" reader or it must wait for a specific indexing generation (because you want to ensure a certain indexing change is visible), when acquiring the searcher.
Hi Mike,
ReplyDeleteI'm implementing NRTManager in c# using Lucene.Net.Contrib.Management.dll. I load all documents using an IndexWriter:
<<
Directory d = new RAMDirectory();
indexWriter = new IndexWriter(d, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_CURRENT), !IndexReader.IndexExists(d), new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
doc = new Document();
doc.Add(new Field(
"info",
info,
Field.Store.YES,
Field.Index.ANALYZED));
// Write the Document to the catalog
indexWriter.AddDocument(doc);
>>
and then initialize the NRTManager with it.
<<
static NrtManager man = new NrtManager(indexWriter);
>>
When I need to add a new entry to the manager I do this:
<<
doc = new Document();
doc.Add(new Field(
"Info",
newInfo,
//dr["NickName"].ToString(),
Field.Store.YES,
Field.Index.ANALYZED));
// Write the Document to the catalog
man.AddDocument(doc);
>>
At search I ALWAYS do this:
<<
if (man.GetSearcherManager().MaybeReopen())
man.GetSearcherManager().Acquire().Searcher.IndexReader.Reopen();
var hits = man.GetSearcherManager().Acquire().Searcher.Search(query, 50);
>>
My problem is that I'm only able to get one new entry after de initial load. When I add a second entry, the search does not get me this one.
Can you help me with this, please?
Thanks,
Galder.
You should not need to call Searcher.IndexReader.Reopen like that, assuming the C# port is like Lucene's. A single call to .maybeRefresh will open a new NRT reader, if there are any changes.
DeleteAlso, NRTManager (renamed / factored out a while back to ControlledRealTimeReopenThread in Lucene) is only needed when you have some threads that want a "real-time" reader and other threads that are OK with the current near-real-time reader.
Maybe you should simplify your test to just use an "ordinary" SearcherManager and see if the problem still happens?
If so, there must be a bug somewhere in tracking of changes in the C# IndexWriter...
Hello Mike,
ReplyDeleteNeed your help to address the below error while refreshing the lucene index,
java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
We have a batch process which on a daily basis, creates the index and refreshes with old indexes and we do have an api, which will be consuming the indexes mean time.
We are getting an error from api while this refresh happens.
Can you help us to know, what is the best practice to refresh the lucene indexes without affecting any existing components which are using the lucene indexes?
Your suggestions are highly appreciated.
Regards,
Pavan
Hi Pavan,
DeleteYou should use SearcherManager -- it makes it really simple to refresh the searcher while queries are still in flight across multiple threads.
It's best to ask on the Lucene user's list -- java-user@lucene.apache.org
Thanks a lot, Mike. I am using searcher manager to refresh the documents and getting the error. I will reach out to lucene user's list.
DeleteRegards,
Pavan