- Atomicity: when you make changes (adding, removing
documents) in an
IndexWritersession, and then commit, either all (if the commit succeeds) or none (if the commit fails) of your changes will be visible, never something in-between. Some methods have their own atomic behavior: if you call
updateDocument, which is implemented as a delete followed by an add, you'll never see the delete without the add, even if you open a near-real-time (NRT) reader or commit from a separate thread. Similarly if you add a block of documents, using the relatively new
addDocumentsmethod, you'll see either none or all of the documents in any reader you obtain.
- Consistency: if the computer or OS crashes, or the JVM
crashes or is killed, or power is lost, your index will remain intact
(ie, not corrupt). Note that other problems, such as bad RAM, a
bit-flipping CPU or file system corruption, can still easily corrupt
- Isolation: while
IndexWriteris making changes, nothing is visible to any
IndexReadersearching the index, until you commit or open a new NRT reader. Only one
IndexWriterinstance at a time can change the index.
- Durability: once
commitreturns, all changes have been written to durable storage (assuming your I/O system correctly implements
fsync). If the computer or OS crashes, or the JVM crashes or is killed, or power is lost to the computer, all changes will still be present in the index.
commit API: call the
prepareCommit method to do all of the hard work
(applying buffered deletes, writing buffered documents, fsyncing files). If something
is going to go wrong (e.g., disk fills up) it'll almost certainly happen during
this first phase. Then, call
commit to complete the transaction.
When you close the
IndexWriter, it calls
the hood. If, instead, you want to discard all changes since the last
commit, call the
rollback method instead, which also
closes the writer. You can even rollback a
you have an existing index, and you open an
on it with
OpenMode.CREATE, and then rollback, the index will
be unchanged. Likewise, if you call
deleteAll and then
Note that merely opening an
IndexWriter on a new
directory does not create an empty commit; ie, you cannot open an
the directory until you've called
Lucene does not implement
log itself, but it's easy to build that layer out on top. For example, popular search servers such
as Solr and
ElasticSearch, do so.
Multiple commits in one index
A single Lucene index is free to contain more than one commit; this is a powerful yet often overlooked feature. Each commit holds a point-in-time view of the index as it existed when the commit was created.
This is similar to the snapshots and writable clones available in modern filesystems like ZFS and the up-and-coming Btrfs. In fact, Lucene is able to efficiently expose multiple commits for the very same underlying reason: all index segments and files are write-once, just like the file blocks in ZFS and Btrfs.
To save multiple commits in your index, just implement your own
IndexDeletionPolicy and pass it to
IndexWriter. This is the class Lucene uses
to know which commits should be deleted:
IndexWriter invokes it on opening an index
and whenever a commit succeeds. The default
KeepOnlyLastCommitDeletionPolicy, deletes all but
the last commit. If you use
NoDeletionPolicy then every
commit is retained!
You can pass a
commit, to record custom
information (opaque to Lucene) about that commit, and then use
find all commits in the index. Once you've found a commit, you can
IndexReader on it to search the index as of that commit.
You can also open an
IndexWriter on a prior commit, to effectively roll back all
changes after it: this is just like the
method, except it enables you to rollback across commits and
not just the changes made in the current
Old commits are still kept even when you open an index with
OpenMode.CREATE. It's also fine to
IndexReaders are still
searching the old commits. This enables fun use cases, such as
fully re-indexing your content between each commit without affecting
any open readers.
Combining all of these fun transactional features, you can do some cool things:
PersistentSnapshotDeletionPolicy: these deletion policies make it trivial to take a "live" backup of the index without blocking ongoing changes with
IndexWriter. The backup can easily be incremental (just copy the new files, remove the deleted ones), and you can freely throttle the IO to minimize any interference with searching.
- Searching different catalog versions: perhaps you run an e-commerce site, and but you ship multiple
versions of your catalog. In this case you can keep older commits
around, each searching a specific version of your catalog, enabling
users to choose which catalog to search.
- Repeatable indexing tests from the same initial index: maybe you
want to run a bunch of performance tests, perhaps trying different
RAM buffer sizes or merge factors, starting from a large initial
index. To do this, simply run each test, but in the end, instead of
IndexWriter, use the
rollbackmethod to quickly return the index to its initial state, ready for the next test.
- Force all index segments to be merged down to a single segment, but
also keep the prior multi-segment commit. Then you can do
tests to compare multi-segment vs single-segment performance.
- Indexing and searching over the NFS
file system: because NFS does not protect still-open files from
deletion, you must use an
IndexDeletionPolicyto keep each commit around until all open readers have finished with the commit (ie, reopened to a newer commit). The simple approach is time-based, for example: don't delete the commit until it is 15 minutes old, and then always reopen your readers every 5 minutes. Without this you'll hit all sorts of scary exceptions when searching over NFS.
- Distributed commit: if you have other resources that must commit
atomically along with the changes to your Lucene index, you can use
the two-phased commit API. This is simple, but vulnerable to failures during the 2nd phaes; to also recover from such cases, for example if Lucene completed its 2nd phase
commit but the database's 2nd phase hit some error or crash or power loss, you can easily
rollback Lucene's commit by opening an
IndexWriteron the prior commit.
- Experimental index changes: maybe you want to try re-indexing some
subset of your index in a new way, but you're not sure it'll work
out. In this case, just keep the old commit around, and then
rollback if it didn't work out, or delete the old commit if it did.
- Time-based snapshots: maybe you'd like the freedom to roll back to your index as it existed 1 day ago, 1 week ago, 1 month ago, etc., so you preserve commits based on their age.