Comments on Changing Bits: Lucene's near-real-time search is fast!

Hi Xatrix, Could you send an email to the Lucene ...

2013-01-14T08:47:03.016-05:00

Hi Xatrix,

Could you send an email to the Lucene users list instead (java-user@lucene.apache.org)? Thanks.

Hello Mike, I have got a problem and I would like...

2013-01-14T06:34:02.129-05:00

Hello Mike,

I have got a problem and I would like to ask you for advise. Every 10 minutes I am updating approx 6k documents with lucene 3.6, there might be some new documents added but most of the time we are just updating and need to see only the latest. I am using NRT and commit index writer after 1000 docs. After approx. 100 iteration I am seeing the problem with constant merging of files (it seems that it constantly merging files with one document?) Can you please advise what policy I should use? From the logs

Also I have noticed such messages in log:
DW: ramUsed=889.721 MB newFlushedSize=0.1 MB docs/MB=9.982 new/old=0.011%

IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: don't apply deletes now delTermCount=42 bytesUsed=43008
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: clearFlushPending
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: findMerges: 15 segments
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=*_1d9wh(3.5):C4916/288 size=52.694 MB
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da6s(3.5):C159 size=2.079 MB
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da5p(3.5):C153 size=2.025 MB [merging]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da73(3.5):C10 size=0.223 MB [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7e(3.5):C1 size=0.100 MB [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da78(3.5):C1 size=0.100 MB [merging] [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da76(3.5):C1 size=0.100 MB [merging] [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7g(3.5):C1 size=0.100 MB [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da77(3.5):C1 size=0.100 MB [merging] [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7b(3.5):C1 size=0.100 MB [merging] [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da74(3.5):C1 size=0.100 MB [merging] [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7f(3.5):C1 size=0.071 MB [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7a(3.5):C1 size=0.068 MB [merging] [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da7c(3.5):C1 size=0.068 MB [merging] [floored]
IW 0 [Mon Jan 14 10:50:06 UTC 2013; PrepDoc-12]: TMP: seg=_1da79(3.5):C1 size=0.068 MB [merging] [floored]

Thanks for any response

Anyone have an example of IndexWriter.SetMergedSeg...

2012-09-04T17:27:16.182-04:00

Anyone have an example of IndexWriter.SetMergedSegmentWarmer usage?

It is indeed tricky, but I think that boolean to g...

2012-02-05T08:09:04.656-05:00

It is indeed tricky, but I think that boolean to getSearcherManager (in addition to maybeReopen) is still useful.

Imagine an app where sometimes you need deletes applied (users doing searches) and other times you don't (say, testing whether a certain document is in the index, such that you don't care if the old deleted "ghosts" are also returned).

For that 2nd case you should pass "false" for applyDeletes to waitForGen, because this tells the reopen thread that the caller who is waiting doesn't need deletes applied, which in turn makes the reopen faster.

okay..... (and BTW I WAS using the reopen thread, ...

2012-02-03T11:00:15.607-05:00

okay..... (and BTW I WAS using the reopen thread, that's why Thread.sleep(45000) showed the changes, even with maybeReopen commmented out)....then why even have ANY boolean argument applyDeletes except for maybeReopen?Am I missing something here? Because it seems like you are saying "If you absolutely must have a searcher that reflects the deletes,either the NRTThread must have run (and you can use waitForGeneration), or apply maybeReOpen(true). And then both getSearcherManager(true) and getSearchManager(false) will BOTH reflect the deletes". Do you follow me that this is confusing, or am I being obtuse?

Hi MJB, The boolean you pass to getSearcherManage...

2012-02-02T11:20:47.246-05:00

Hi MJB,

The boolean you pass to getSearcherManager does not force a reopen. So if you pass true, all it does is get you (immediately) the latest searcher that was reopened with applyAllDeletes=true. Ie, that returned searcher is only guaranteed to have applied all deletes as of when the last maybeReopen(true) was called.

So you'll still need to use the .waitForGeneration APIs if you need to know a certain operation (the delete op in your example) is visible...

It's also best to use something like the NRTManagerReopenThread: this class watches for any threads waiting on a specific generation, and reopens more aggressively if there are any.

Mike

Code got mangled: Here it is java.io.File path...

2012-02-01T14:34:51.482-05:00

Code got mangled:

Here it is

java.io.File path=new File("c:\\index");
LuceneBuildIndex.createNewIndex(path, 100000);
IndexWriter iwriter=null;
IndexSearcher one=null; IndexSearcher two=null;
IndexWriterConfig iwc =new IndexWriterConfig(LuceneBasicSearch2.version,new StandardAnalyzer(LuceneBasicSearch2.version));
Directory directory = new SimpleFSDirectory(path);
try {
iwriter=new IndexWriter(directory,iwc);
NRTSingleton.getInstance().setNRTManager(iwriter);
one =NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
ResultList r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000);
LuceneBasicSearch2.output("Search before deletes", r,true); // 10000
NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);
one=null;
System.out.println("Delete item");
NRTSingleton.getInstance().getNRTManager().deleteDocuments(new Term("subject", "apple"));
one = NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
System.out.println(NRTSingleton.getInstance().getNRTManager().maybeReopen(true)); // This is critical
two = NRTSingleton.getInstance().getNRTManager().getSearcherManager(true).acquire();
r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000); // 10000
LuceneBasicSearch2.output("Search after deletes - one", r,true);

r=LuceneBasicSearch2.searchIndex("apple",two,0, 10000); // 0
LuceneBasicSearch2.output("Search after deletes - two", r,true);

NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);
one=null;
//NRTSingleton.getInstance().getNRTManager().maybeReopen(true);
System.out.println("Sleep");
Thread.sleep(45000L);
System.out.println("After sleep");
one = NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000);
LuceneBasicSearch2.output("Search after deletes and sleep - new one", r,true); // 0 regardless of maybeReopen here since background thread goes every 30 seconds
NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);
one=null;
} finally {
if (one !=null) NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);
if (two !=null) NRTSingleton.getInstance().getNRTManager().getSearcherManager(true).release(two);
if (iwriter !=null) iwriter.close();
NRTSingleton.getInstance().close();
}

Mike - this might be clearer in the docs. Imagine...

2012-02-01T14:33:14.909-05:00

Mike - this might be clearer in the docs.

Imagine you have this code

// Previously build the index of 100000 items of which 10000 have subject:apple
// Previously run a search showing 10000 results
System.out.println("Delete item");
NRTSingleton.getInstance().getNRTManager().deleteDocuments(new Term("subject", "apple"));
one = NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
System.out.println(NRTSingleton.getInstance().getNRTManager().maybeReopen(true)); // This is critical, and the "two" will fail otherwise to see the deletion until 45 seconds
two = NRTSingleton.getInstance().getNRTManager().getSearcherManager(true).acquire();
r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000); // 10000
LuceneBasicSearch2.output("Search after deletes - one", r,true);

r=LuceneBasicSearch2.searchIndex("apple",two,0, 10000); // 0
LuceneBasicSearch2.output("Search after deletes - two", r,true);

NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).release(one);

Here's my point.Everything mostly worked as expected. A searcher with (false) passed didn't see the deletes until after the sleep (the nrtbackground thread is set to about 30 s). But it surprised and confused me that the searcher (two) ALSO didn't see the deletes until after the sleep, unless you force the issue with maybeReopen(true). I guess I would have thought the getSearcherManager(true) would have automagically forced a separate IndexSearcher in this case?

Do you follow me?
one=null;
//NRTSingleton.getInstance().getNRTManager().maybeReopen(true);
System.out.println("Sleep");
Thread.sleep(45000L);
System.out.println("After sleep");
one = NRTSingleton.getInstance().getNRTManager().getSearcherManager(false).acquire();
r=LuceneBasicSearch2.searchIndex("apple",one,0, 10000);
LuceneBasicSearch2.output("Search after deletes and sleep - new one", r,true); // 0 regardless of maybeReopen here since background thread goes every 30 seconds

Hi Mike, I´m just trying to find out how good Lu...

2011-11-16T06:14:39.318-05:00

Hi Mike,

I´m just trying to find out how good Lucene is in comparison to other search tools.

I recently looked at www.deudil.com and was impressed by the autopopulation and search experience but wasn´t sure whether there is a way to find out what they used.

I´m really keen to select the best solution to deliver the search experience for our website hence would appreciate your guidance. I read a lot about Funnelback as well but not sure about their solution either.

Thanks

Not a dumb question; it's a great one! I'...

2011-11-01T08:59:05.581-04:00

Not a dumb question; it's a great one!

I'm planning to do a blog post explaining the difference... but the gist is this: use NRTManager if you sometimes care to control which indexing changes are visible to which search threads. Otherwise, use SearcherManager.

For example say you use Lucene to search discussion threads (on a forum). When user X adds a new comment on a discussion thread, you then go and add that comment into the Lucene index. But then when user X searches, rather than just pulling the "current" NRT reader, which may not yet reflect this user's added comment, you should pull a reader that's guaranteed to reflect the change from the comment the user just added. For this (tying certain searches to certain changes in the index), you need to use NRTManager.

Thanks Mike - I guess this is a dumb question but ...

2011-10-31T17:36:31.662-04:00

Thanks Mike - I guess this is a dumb question but why use SearcherManager vs NRTManager? Well I guess if the node is read only that's one reason, but is there another

Hi mjb, You can of course ask questions without b...

2011-10-31T08:11:23.874-04:00

Hi mjb,

You can of course ask questions without buying the book! ;)

That use case you described is exactly what NRTManager is good at! Note also that the next Lucene release (3.5.0) will have improvements to SearcherManager to also pull NRT readers from an IndexWriter.

NRTManager (and SearcherManager) do in fact close the readers, but it's tricky: rather than call .close(), they call .decRef(); under the hood, .close() in fact calls .decRef, but guarded to ensure only the first .close() decRefs. So the readers are getting closed (and we have good stress tests in Lucene that should fail if they were not getting closed...).

Mike - I've looked at your code and bought LIA...

2011-10-30T20:24:26.015-04:00

Mike - I've looked at your code and bought LIA. So I hope I earned a question or two :)

The NRTManager - does it work well in the scenario that you open one IndexWriter, use that mostly for indexing in one thread, but occasionally for deleting documents in another, and use the IndexSearcher provided here for all else? That appears to be yes, and indeed its primary purpose but the code confuses me in one respect:

The "real questions" - NRTManager never closes any of the old IndexReader or IndexSearcher instances. Isn't this a problem? And didn't we always get told best practice was to reuse the indexsearcher :)

Hi Mike, Great post. Please let me know, is there...

2011-08-01T15:34:18.726-04:00

Hi Mike,

Great post. Please let me know, is there an e-mail address I can contact you in private?

Hi Pascal, A commit must also fsync() the new fil...

2011-06-23T16:34:19.069-04:00

Hi Pascal,

A commit must also fsync() the new files, so that the bits are all guaranteed to be on stable storage, which can be extremely costly (seconds).

Also, the commit must write a new segments_N file, and write deleted docs to disk, which NRT avoids (because it gets these changes directly in RAM).

Hi, Just a question. You mention that "when ...

2011-06-23T16:24:06.389-04:00

Hi,

Just a question. You mention that "when an NRT reader is opened, Lucene flushes indexed documents as a new segment". So how exactly is this different than a commit?

Thanks for that nice article!

Hi Yeroc, I haven't looked closely a Zoie for...

2011-06-10T15:44:02.394-04:00

Hi Yeroc,

I haven't looked closely a Zoie for some time... so the following
could very well be "dated":

The biggest difference is that Zoie aims for immediate consistency
(reopen after every index change & next query), which I think very few
apps really require, given how fast NRT is.

Also, NRTCachingDir (caching small segments in RAM) achieves the
biggest (in my opinion) benefit of Zoie, but with substantially less
added complexity. Reducing complexity is important because it means
less risk of bugs; for example, Zoie had some scary corruption bugs,
which took quite some time to track down; see
https://issues.apache.org/jira/browse/LUCENE-2729

The other part of Zoie I remember is deferring resolving deletions to
Lucene docIDs, and instead using a bloom filter to post-filter
collected documents. While I understand the motivation for this
("immediate consistency") I think it's the wrong tradeoff since it
necessarily slows down all searching (checking a bloom filter is more
costly than Lucene's checking a bit set), not to mention the added RAM
required for the bloom filter.

Ie, it's better to spend more time during reopen to resolve the
deletions, so that searches don't slow down.

I'm sure there are other differences... and I imagine Zoie has changed
a lot since I last looked!

I'm curious how this compares to using Zoie (h...

2011-06-09T15:29:43.347-04:00

I'm curious how this compares to using Zoie (http://sna-projects.com/zoie/) from LinkedIn on top of Lucene? I know they did an awful lot of tuning. Not sure why their work hasn't made it into the core of Lucene.

Ahh, good point mindas; I just fixed it. Thanks!

2011-06-09T06:40:13.415-04:00

Ahh, good point mindas; I just fixed it. Thanks!

It's worth mentioning that IndexReader.open(In...

2011-06-09T06:30:10.732-04:00

It's worth mentioning that IndexReader.open(IndexWriter) is only available since Lucene v3.1; earlier versions don't have this method (article assumes that NRT was available since 2.9 and can be misleading). Otherwise a very interesting read, thanks Mike!