While indexing, Lucene periodically merges multiple segments in the index into a single larger segment. This keeps the number of segments relatively contained (important for search performance), and also reclaims disk space for any deleted docs on those segments.
However, it has a well known problem: the merging process evicts pages from the OS's buffer cache. The eviction is ~2X the size of the merge, or ~3X if you are using compound file.
If the machine is dedicated to indexing, this usually isn't a problem; but on a machine that's also searching, this can be catastrophic as it "unwarms" your warmed reader. Users will suddenly experience long delays when searching. And because a large merge can take hours, this can mean hours of suddenly poor search performance.
So why hasn't this known issue been fixed yet? Because Java, unfortunately, does not expose access to the low-level APIs (posix_fadvise, posix_madvise) that would let us fix this. It's not even clear whether NIO.2 (in Java 7) will expose these.
On the Lucene dev list we've long assumed that these OS-level functions should fix the issue, if only we could access them.
So I decided to make a quick and dirty test to confirm this, using a small JNI extension.
I created a big-ish (~7.7G) multi-segment Wikipedia index, and then ran a set of ~2900 queries against this index, over and over, letting it warm up the buffer cache. Looking at /proc/meminfo (on Linux) I can see that the queries require ~1.4GB of hot RAM in the buffer cache (this is a CentOS Linux box with 3G RAM; the index is on a "normal" SATA hard drive). Finally, in a separate JRE, I opened an IndexWriter and called optimize on the index.
I ran this on trunk (4.0-dev), first, and confirmed that after a short while, the search performance indeed plummets (by a factor of ~35), as expected. RAM is much faster than hard drives!
Next, I modified Lucene to call posix_fadvise with the NOREUSE flag; from the man page, this flag looks perfect:
Specifies that the application expects to access the specified data once and then not reuse it thereafter.
I re-ran the test and.... nothing changed! Exactly the same slowdown. So I did some digging, and found Linux's source code for posix_fadvise. If you look closely you'll see that the NOREUSE is a no-op! Ugh.
This is really quite awful. Besides Lucene, I can imagine a number of other apps that really should use this flag. For example, when mencoder slowly reads a 50 GB bluray movie, and writes a 5 GB H.264 file, you don't want any of those bytes to pollute your buffer cache. Same thing for rsync, backup programs, software up-to-date checkers, desktop search tools, etc. Of all the flags, this one seems like the most important to get right! It's possible other OSs do the right thing; I haven't tested.
So what to do?
One approach is to forcefully free the pages, using the DONTNEED flag. This will drop the specified pages from the buffer cache. But there's a serious problem: the search process is using certain pages in these files! So you must only drop those pages that the merge process, alone, had pulled in. You can use the mincore function, to query for those pages that are already cached, so you know which ones not to drop. A neat patch for rsync took exactly this approach. The problem with this is mincore provides only a snapshot, so you'd have to call it many times while merging to try to minimize discarding pages that had been recently cached for searching.
We should not have to resort to such silly hacks!
Another approach is to switch to memory-mapped IO, using Lucene's MMapDirectory, and then use madvise. The SEQUENTIAL option looks promising from the man page:
Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)
Looking through the linux sources it look like the SEQUENTIAL option is at least not a no-op; that setting has some influence over how pages are evicted.
So I tested that, but, alas, the search performance still plummets. No go!
Yet another approach is to bypass all OS caching entirely, only when merging, by using the Linux-specific O_DIRECT flag. Merge performance will definitely take a hit, since the OS is no longer doing readahead nor write caching, and every single IO request must hit the disk while you wait, but for many apps this could be a good tradeoff.
So I created a prototype Directory implementation, a variant of DirectNIOFSDirectory (currently a patch on LUCENE-2056), that opened all files (input and output) with O_DIRECT (using jni). It's a little messy because all IO must be "aligned" by certain rules (I followed the rules for 2.6.* kernels).
Finally, this solution worked great! Search performance was unchanged all through the optimize call, including building the CFS at the end. Third time's a charm!
However, the optimize call slowed down from 1336 to 1680 seconds (26% slower). This could likely be reduced by further increasing the buffer sizess (I used 1 MB buffer for each IndexInput and IndexOutput, which is already large), or possibly creating our own readahead / write cache scheme.
We really should not have to resort to such silly hacks!