- Atomicity: when you make changes (adding, removing
documents) in an
IndexWritersession, and then commit, either all (if the commit succeeds) or none (if the commit fails) of your changes will be visible, never something in-between. Some methods have their own atomic behavior: if you call
updateDocument, which is implemented as a delete followed by an add, you'll never see the delete without the add, even if you open a near-real-time (NRT) reader or commit from a separate thread. Similarly if you add a block of documents, using the relatively new
addDocumentsmethod, you'll see either none or all of the documents in any reader you obtain.
- Consistency: if the computer or OS crashes, or the JVM
crashes or is killed, or power is lost, your index will remain intact
(ie, not corrupt). Note that other problems, such as bad RAM, a
bit-flipping CPU or file system corruption, can still easily corrupt
- Isolation: while
IndexWriteris making changes, nothing is visible to any
IndexReadersearching the index, until you commit or open a new NRT reader. Only one
IndexWriterinstance at a time can change the index.
- Durability: once
commitreturns, all changes have been written to durable storage (assuming your I/O system correctly implements
fsync). If the computer or OS crashes, or the JVM crashes or is killed, or power is lost to the computer, all changes will still be present in the index.
commit API: call the
prepareCommit method to do all of the hard work
(applying buffered deletes, writing buffered documents, fsyncing files). If something
is going to go wrong (e.g., disk fills up) it'll almost certainly happen during
this first phase. Then, call
commit to complete the transaction.
When you close the
IndexWriter, it calls
the hood. If, instead, you want to discard all changes since the last
commit, call the
rollback method instead, which also
closes the writer. You can even rollback a
you have an existing index, and you open an
on it with
OpenMode.CREATE, and then rollback, the index will
be unchanged. Likewise, if you call
deleteAll and then
Note that merely opening an
IndexWriter on a new
directory does not create an empty commit; ie, you cannot open an
the directory until you've called
Lucene does not implement
log itself, but it's easy to build that layer out on top. For example, popular search servers such
as Solr and
ElasticSearch, do so.
Multiple commits in one index
A single Lucene index is free to contain more than one commit; this is a powerful yet often overlooked feature. Each commit holds a point-in-time view of the index as it existed when the commit was created.
This is similar to the snapshots and writable clones available in modern filesystems like ZFS and the up-and-coming Btrfs. In fact, Lucene is able to efficiently expose multiple commits for the very same underlying reason: all index segments and files are write-once, just like the file blocks in ZFS and Btrfs.
To save multiple commits in your index, just implement your own
IndexDeletionPolicy and pass it to
IndexWriter. This is the class Lucene uses
to know which commits should be deleted:
IndexWriter invokes it on opening an index
and whenever a commit succeeds. The default
KeepOnlyLastCommitDeletionPolicy, deletes all but
the last commit. If you use
NoDeletionPolicy then every
commit is retained!
You can pass a
commit, to record custom
information (opaque to Lucene) about that commit, and then use
find all commits in the index. Once you've found a commit, you can
IndexReader on it to search the index as of that commit.
You can also open an
IndexWriter on a prior commit, to effectively roll back all
changes after it: this is just like the
method, except it enables you to rollback across commits and
not just the changes made in the current
Old commits are still kept even when you open an index with
OpenMode.CREATE. It's also fine to
IndexReaders are still
searching the old commits. This enables fun use cases, such as
fully re-indexing your content between each commit without affecting
any open readers.
Combining all of these fun transactional features, you can do some cool things:
PersistentSnapshotDeletionPolicy: these deletion policies make it trivial to take a "live" backup of the index without blocking ongoing changes with
IndexWriter. The backup can easily be incremental (just copy the new files, remove the deleted ones), and you can freely throttle the IO to minimize any interference with searching.
- Searching different catalog versions: perhaps you run an e-commerce site, and but you ship multiple
versions of your catalog. In this case you can keep older commits
around, each searching a specific version of your catalog, enabling
users to choose which catalog to search.
- Repeatable indexing tests from the same initial index: maybe you
want to run a bunch of performance tests, perhaps trying different
RAM buffer sizes or merge factors, starting from a large initial
index. To do this, simply run each test, but in the end, instead of
IndexWriter, use the
rollbackmethod to quickly return the index to its initial state, ready for the next test.
- Force all index segments to be merged down to a single segment, but
also keep the prior multi-segment commit. Then you can do
tests to compare multi-segment vs single-segment performance.
- Indexing and searching over the NFS
file system: because NFS does not protect still-open files from
deletion, you must use an
IndexDeletionPolicyto keep each commit around until all open readers have finished with the commit (ie, reopened to a newer commit). The simple approach is time-based, for example: don't delete the commit until it is 15 minutes old, and then always reopen your readers every 5 minutes. Without this you'll hit all sorts of scary exceptions when searching over NFS.
- Distributed commit: if you have other resources that must commit
atomically along with the changes to your Lucene index, you can use
the two-phased commit API. This is simple, but vulnerable to failures during the 2nd phaes; to also recover from such cases, for example if Lucene completed its 2nd phase
commit but the database's 2nd phase hit some error or crash or power loss, you can easily
rollback Lucene's commit by opening an
IndexWriteron the prior commit.
- Experimental index changes: maybe you want to try re-indexing some
subset of your index in a new way, but you're not sure it'll work
out. In this case, just keep the old commit around, and then
rollback if it didn't work out, or delete the old commit if it did.
- Time-based snapshots: maybe you'd like the freedom to roll back to your index as it existed 1 day ago, 1 week ago, 1 month ago, etc., so you preserve commits based on their age.
The "different catalog versions" example and the "Experimental index changes" ideas sound really practical.ReplyDelete
I think you're absolutely right when you call the multiple commits feature "overlooked" (i didn't know this before), because this allows an index to be versioned, correct?
Is it even possible to identify the differences between two commits on the document- or even field-level (like the diff feature in revision control systems) or am i mislead?
Exactly, this allows versioning the index; your app gets to decide when to take a point-in-time snapshot to create a new version (ie, by committing).
You could in theory compute a diff between two commits... Lucene doesn't have such an "index differ" (hmm maybe in our test-framework we might have something closish), but the app could build out that diffing on top of Lucene's public APIs. At that point you could diff two separate indices (ie, different directories), or two commits within a single index.
It's fun stuff!
can you please spot the above mentioned closish test code in the test framework. I've managed to find only TestIndicesEquals test case but this is rather for index equality check. Is this the one you mentioned above? Can you please give me some advice how to implement such a differ in general? I'm eager to contribute it back to the community if my trials would end up some useful feature.
Thanks in advance for your help!
Maybe oal.index.TestStressIndexing2? It has a series of "verifyEquals" methods.Delete
Nice Blog. BTW, I was wondering what would be the reason that someone would want to implementa a transactional log for Lucene, if Lucene already supports all these transactional semantics ?ReplyDelete
Transactional log would mean if the app/OS/computer crashed, on startup the log could be replayed to catch the index up to whatever the app had indexed.
Without a transactional log, if the app/OS/computer crashes, the index will fall back to the last successful commit, so you've lost any changes you'd made after that commit.
Awesome usecases explained Mike, I have a query: What happens if I perform a commit( ) on an IndexWriter which is opened on a prior commit in Lucene?ReplyDelete
Does it create some kind of a branch internally for the new commit performed on a prior commit or does it just replace all the newer commits after that prior commit and just retaining the recent commit that I just performed on that prior commit?!
I am very curious to know its answer as it can be useful to me in some future use case.
I have also posted the same question on stackoverflow http://goo.gl/rczlt so you may answer it there also if you like!
In fact it creates a branch: the new commit will reflect the old one you had opened, plus any changes you made during the indexing session. What happens to the future commits (after the one you had opened) is up to your deletion policy. If it saves those commits, then at any time you are free to open a writer against one of them, making a branch from them as well.
(I also answered on StackOverflow).
Thanks for Mike !Delete
Great article, thank you!ReplyDelete
I'm not sure if I understood "Isolation" right in Lucene.
Am I right that if multiple threads share an IndexWriter concurrently, it is possible that one thread commits or rolls back the changes another thread has made? Therefore Lucene doesn't implement "Isolation" in a way that multiple threads sharing one IndexWriter don't see the changes made by other threads until they commit.
So what's the best approach for gaining "real" transaction isolation if multiple threads need to update an index concurrently (e.g. in a web application)?
I've currently identified two approaches:
- Synchronize on a single IndexWriter instance whenever needed.
- Open a new IndexWriter at the beginning of a transaction - Lucene will prevent multiple IndexWriters being opened at the same time on a single index.
That's correct: there is no Isolation between multiple writer threads, only between writers and readers. Every writer thread sees all changes made by the other writer threads.
If you need Isolation between writers, those two approaches will work. You can have each writer write to its own private index directory, and in the end (if necessary) use IndexWriter.addIndexes to copy over that writer's private index into the main index.
Thanks! I will think about that... I'll also take a look at Neo4j - it seems they have added some transactional semantics to Lucene.Delete
Great post, Mike. Many thanks for it!ReplyDelete
I was planning to play with index versioning by setting commit data. And I also wanted to try to implement some kind of branching feature with a customized index deletion policy. The idea was to store only the most recent (HEAD) revision of the index per each branch and delete everything else after a commit. Unfortunately I'm a bit stuck with it, as I also wanted add concurrent read/write access to the index, and I cannot open multiple index writers to prior IndexCommits to represent different branch HEADs in one index directory instance.
What do you think about this approach in general? Does it even make sense? If yes, is it doable and I'm just missing something, if not can you please give me some pointers?
Any kind of feedback is much appreciated.
I think you can index multiple branches into a single index; your IDP would have to track which commit points correspond to the head of which branch. But since you can only have one IW open at once, you'd have to open an IW on an old commit point, keep the other commit points that are heads of other branches, make all changes for this one branch, and close (deleting the previous head commit for the current branch).ReplyDelete
For searching you can of course open multiple readers, one on each head commit point. I would open one reader first, then use openIfChanged from that one to get to the other head commit points.
first off, thank you for the quick reply.
I've implemented the IDP exactly like you've suggested, I opened the readers on the HEAD commit points as well. For doing this I've extended the ReferenceManager as I was unable to find other way to have a reader opened on a particular IndexCommit via ReaderManager. I think the only difference is the way I tried to use the IW. I'll check my test code ASAP and response here if I got the solution.
Thanks again for the help.
Works as a charm. Great feature.Delete
Am I correct assuming, that concurrent commits performed on different 'branches' are not available in my case, since only one IW can be opened on the index.
If concurrent writing to an index is not available, and I would like to avoid IW synchronization I guess I have to open a new private 'slave' index for each 'branch' writer. After committing changes I have to open a writer on the 'master' directory and merge the changes from the 'slave'. Is there a way to merge deletions as well?
Thanks in advance for your reply.
Also, that's a rather annoying limitation that you cannot open a specific IndexCommit with ReaderManager. Can you open a Jira issue? Maybe we can fix that ... thanks!Delete
I've addressed the issue as you've requested.Delete
Thanks for opening that issue; I just added a comment!Delete
Right, only one IW at a time on a given index.ReplyDelete
You can use addIndexes to add in all documents from an external index, but this does not "replace" the updated documents. You'd have to separately keep track of the deletions, and then addIndexes. But indexing is quite fast ... are you sure you need to optimize this?
First off, apologies for my late reply.Delete
You're right, indexing is fast enough in general. But in my case it may happen that the changes made of a user requires only a couple of document addition, deletion and update (which is actually only a couple of milliseconds), while changes performed by another user (on a different branch but in the same time) may required hundreds of thousand document deletion and update (which takes seconds). Since there can be only one shared IW, it may happen that the first user has to wait seconds although only a few modifications were made.
Anyway, thanks for your hints and tips and for you great work what you're doing for the community. Much appreciated.
Ahh, I see. So in that case the first user gets "denial of service" due to first user hogging IndexWriter. I suppose, you could do custom scheduling, i.e. stop the first user's indexing, close IW, open new IW (on 2nd user's latest commit point), index 2nd user, close IW, open IW on first user's commit point and "resume" indexing first user's docs. Tricky :)Delete
That's a brilliant idea. Thank you!Delete
Taking all this into account am I right in saying that after performing a Commit, doing a "Verify Success" by using an IndexReader to open the new index and getting numDocs to ensure it matches what you expect is futile?ReplyDelete
What do you mean by "is futile"?
Once the commit succeeds, all the changes are on disk. But if you want to go and further verify that certain documents are present, that can still be worthwhile, e.g. it could catch bugs in your application where certain docs were not indexed before the commit.
Sorry, I meant that the check provides no benefit (although it seems it may). Would you leave this type of check in production code though or would it proof expensive for a large Index. Currently there is no check for certain documents, just a sanity check around the total doc count.Delete
I don't think there's any need for such a check in production code...Delete
Very informative post!ReplyDelete
I have a situation where I have 2 commit points, one corresponding to saved application data and one corresponding to unsaved data. When I close the application, I want to delete the data(rollback) data corresponding to unsaved commit so that the unsaved data will not be in index anymore.
Is there a way to rollback or delete this data other than reopening the index writer with the Saved commit?
Yes: use IndexWriterCommit.setIndexCommit and then open IndexWriter. This will rollback to that commit point, removing any later commit points.Delete
Ok..thanks for the answer..so..this means that opening index writer is the only way to roll back to a commit point.? Opening index writer only for the rollback is not ideally a choice of mine..as I have to do this in the application shutdown procedure..I will have to close the already opened writer..reopen it to rollback to a commit..and then close it immediately..will any other option work in my case..like deleting a commit or so? (For lucene.net)ReplyDelete
Sorry, that's the only way w/ Lucene's APIs. Even when you IW.rollback with an already opened IW (to rollback all changes since the last commit), that also closes the IW.Delete
Also, I'm answering wrt java Lucene ... I'm not sure whether Lucene.Net has additional APIs here, but likely not (it tries to be a pure port, I think).Delete
i have this situation:
every 1 minute a job (ejb timer) from 0 to max 1000 document in index, i start transaction in this mode:
org.infinispan.Cache cache = (org.infinispan.Cache) getCacheGetter().getTreeCache("Index");
TransactionManager transactionManager = cache.getAdvancedCache().getTransactionManager();
i open index in this mode:
Directory index = DirectoryBuilder.newDirectoryInstance(cache, cache, cache, cacheName).create();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);<-- StandardAnalyzer
IndexWriter indexWriter = new IndexWriter(index, config);
i added more document in cycle:
doc.add(new StringField("name", value, Field.Store.YES));
indexWriter.addDocument(doc); or updateDocument if items exist
well, when the system is under stress i have this exception:
(Lucene Merge Thread #0) Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: Error loading metadata for index file: _9.cfs|M|Event
where am I doing wrong???
i use Infinispan 7 and Lucene 4.10.4Delete
12:44:24,404 ERROR [stderr] (Lucene Merge Thread #0) Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: Error loading metadata for index file: _2t.cfs|M|EventDelete
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:549)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:522)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) Caused by: java.io.FileNotFoundException: Error loading metadata for index file: _2t.cfs|M|Event
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.infinispan.lucene.impl.DirectoryImplementor.openInput(DirectoryImplementor.java:134)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.infinispan.lucene.impl.DirectoryLuceneV4.openInput(DirectoryLuceneV4.java:101)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.store.CompoundFileDirectory.(CompoundFileDirectory.java:104)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:274)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:107)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ReadersAndUpdates.getReaderForMerge(ReadersAndUpdates.java:664)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4152)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3811)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486)
I have a situation where my data is fully indexed. It is possible to make changes to this data and this delta changes gets indexed too, creating new segment files.ReplyDelete
But I should prevent merging of the index data corresponding to delta changes with the original(big) index file.
I was thinking about increasing the MergeFactor to high values so that automatic merging gets disabled(and decide on a custom merge point). But I read that this has implications of running out of file descriptors.
How can I segregate delta index changes efficiently?
Also, it would be better if I can merge the segments created due to delta changes alone without touching the original index file. Is there any option to do this?