Changing Bits: Testing Lucene's index durability after crash or power loss

One of Lucene's useful transactional features is index durability which ensures that, once you successfully call IndexWriter.commit, even if the OS or JVM crashes or power is lost, or you kill -KILL your JVM process, after rebooting, the index will be intact (not corrupt) and will reflect the last successful commit before the crash.

Of course, this only works if your hardware is healthy and your IO devices implement fsync properly (flush their write caches when asked by the OS). If you have data-loss issues, such as a silent bit-flipper in your memory, IO or CPU paths, thanks to the new end-to-end checksum feature (LUCENE-2446), available as of Lucene 4.8.0, Lucene will now detect that as well during indexing or CheckIndex. This is similar to the ZFS file system's block-level checksums, but not everyone uses ZFS yet (heh), and so Lucene now does its own checksum verification on top of the file system.

Be sure to enable checksum verification during merge by calling IndexWriterConfig.setCheckIntegrityAtMerge. In the future we'd like to remove that option and always validate checksums on merge, and we've already done so for the default stored fields format in LUCENE-5580 and (soon) term vectors format in LUCENE-5602, as well as set up the low-level IO APIs so other codec components can do so as well, with LUCENE-5583, for Lucene 4.8.0.

FileDescriptor.sync and fsync

Under the hood, when you call IndexWriter.commit, Lucene gathers up all newly written filenames since the last commit, and invokes FileDescriptor.sync on each one to ensure all changes are moved to stable storage.

At its heart, fsync is a complex operation, as the OS must flush any dirty pages associated with the specified file from its IO buffer cache, work with the underlying IO device(s) to ensure their write caches are also flushed, and also work with the file system to ensure its integrity is preserved. You can separately fsync the bytes or metadata for a file, and also the directory(ies) containing the file. This blog post is a good description of the challenges.

Recently we've been scrutinizing these parts of Lucene, and all this attention has uncovered some exciting issues!

In LUCENE-5570, to be fixed in Lucene 4.7.2, we discovered that the fsync implementation in our FSDirectory implementations is able to bring new 0-byte files into existence. This normally isn't a problem by itself, because IndexWriter shouldn't fsync a file that it didn't create. However, it exacerbates debugging when there is a bug in IndexWriter or in the application using Lucene (e.g., directly deleting index files that it shouldn't). In these cases it's confusing to discover these 0-byte files so much later, versus hitting a FileNotFoundException at the point when IndexWriter tried to fsync them.

In LUCENE-5588, to be fixed in Lucene 4.8.0, we realized we must also fsync the directory holding the index, otherwise it's possible on an OS crash or power loss that the directory won't link to the newly created files or that you won't be able to find your file by its name. This is clearly important because Lucene lists the directory to locate all the commit points (segments_N files), and of course also opens files by their names.

Since Lucene does not rely on file metadata like access time and modify time, it is tempting to use fdatasync (or FileChannel.force(false) from java) to fsync just the file's bytes. However, this is an optimization and at this point we're focusing on bugs. Furthermore, it's likely this won't be any faster since the metadata must still be sync'd by fdatasync if the file length has changed, which is always the case in Lucene since we only append to files when writing (we removed Indexoutput.seek in LUCENE-4399).

In LUCENE-5574, to be fixed as of Lucene 4.7.2, we found that a near-real-time reader, on closing, could delete files even if the writer it was opened from has been closed. This is normally not a problem by itself, because Lucene is write-once (never writes to the same file name more than once), as long as you use Lucene's APIs and don't modify the index files yourself. However, if you implement your own index replication by copying files into the index, and if you don't first close your near-real-time readers, then it is possible closing them would remove the files you had just copied.

During any given indexing session, Lucene writes many files and closes them, many files are deleted after being merged, etc., and only later, when the application finally calls IndexWriter.commit, will IndexWriter then re-open the newly created files in order to obtain a FileDescriptor so we can fsync them.

This approach (closing the original file, and then opening it again later in order to sync), versus never closing the original file and syncing that same file handle you used for writing, is perhaps risky: the javadocs for FileDescriptor.sync are somewhat vague as to whether this approach is safe. However, when we check the documentation for fsync on Unix/Posix and FlushFileBuffers on Windows, they make it clear that this practice is fine, in that the open file descriptor is really only necessary to identify which file's buffers need to be sync'd. It's also hard to imagine an OS that would separately track which open file descriptors had made which changes to the file. Nevertheless, out of paranoia or an abundance of caution, we are also exploring a possible patch on LUCENE-3237 to fsync only the originally opened files.

Testing that fsync really works

With all these complex layers in between your application's call to IndexWriter.commit and the laws of physics ensuring little magnets were flipped or a few electrons were moved into a tiny floating gate in a NAND cell, how can we reliably test that the whole series of abstractions is actually working?

In Lucene's randomized testing framework we have a nice evil Directory implementation called MockDirectoryWrapper. It can do all sorts of nasty things like throw random exceptions, sometimes slow down opening, closing and writing of some files, refuse to delete still-open files (like Windows), refuse to close when there are still open files, etc. This has helped us find all sorts of fun bugs over time.

Another thing it does on close is to simulate an OS crash or power loss by randomly corrupting any un-sycn'd files and then confirming the index is not corrupt. This is useful for catching Lucene bugs where we are failing to call fsync when we should, but it won't catch bugs in our implementation of sync in our FSDirectory classes, such as the frustrating LUCENE-3418 (first appeared in Lucene 3.1 and finally fixed in Lucene 3.4).

So, to catch such bugs, I've created a basic test setup, making use of a simple Insteon on/off device, along with custom Python bindings I created long ago to interact with Insteon devices. I already use these devices all over my home for controlling lights and appliances, so also using this for Lucene is a nice intersection of two of my passions!

The script loops forever, first updating the sources, compiling, checking the index for corruption, then kicking off an indexing run with some randomization in the settings, and finally, waiting a few minutes and then cutting power to the box. Then, it restores power, waits for the machine to be responsive again, and starts again.

So far it's done 80 power cycles and no corruption yet. Good news!

To "test the tester", I tried temporarily changing fsync to do nothing, and indeed after a couple iterations, the index became corrupt. So indeed the test setup seems to "work".

Currently the test uses Linux on a spinning magnets hard drive with the ext4 file system. This is just a start, but it's better than no proper testing for Lucene's fsync. Over time I hope to test different combinations of OS's, file systems, IO hardware, etc.

Changing Bits

Friday, April 11, 2014

Testing Lucene's index durability after crash or power loss

2 comments: