Monday, June 14, 2010

Lucene and fadvise/madvise

While indexing, Lucene periodically merges multiple segments in the index into a single larger segment. This keeps the number of segments relatively contained (important for search performance), and also reclaims disk space for any deleted docs on those segments.

However, it has a well known problem: the merging process evicts pages from the OS's buffer cache. The eviction is ~2X the size of the merge, or ~3X if you are using compound file.

If the machine is dedicated to indexing, this usually isn't a problem; but on a machine that's also searching, this can be catastrophic as it "unwarms" your warmed reader. Users will suddenly experience long delays when searching. And because a large merge can take hours, this can mean hours of suddenly poor search performance.

So why hasn't this known issue been fixed yet? Because Java, unfortunately, does not expose access to the low-level APIs (posix_fadvise, posix_madvise) that would let us fix this. It's not even clear whether NIO.2 (in Java 7) will expose these.

On the Lucene dev list we've long assumed that these OS-level functions should fix the issue, if only we could access them.

So I decided to make a quick and dirty test to confirm this, using a small JNI extension.

I created a big-ish (~7.7G) multi-segment Wikipedia index, and then ran a set of ~2900 queries against this index, over and over, letting it warm up the buffer cache. Looking at /proc/meminfo (on Linux) I can see that the queries require ~1.4GB of hot RAM in the buffer cache (this is a CentOS Linux box with 3G RAM; the index is on a "normal" SATA hard drive). Finally, in a separate JRE, I opened an IndexWriter and called optimize on the index.

I ran this on trunk (4.0-dev), first, and confirmed that after a short while, the search performance indeed plummets (by a factor of ~35), as expected. RAM is much faster than hard drives!

Next, I modified Lucene to call posix_fadvise with the NOREUSE flag; from the man page, this flag looks perfect:

Specifies that the application expects to access the specified data once and then not reuse it thereafter.

I re-ran the test and.... nothing changed! Exactly the same slowdown. So I did some digging, and found Linux's source code for posix_fadvise. If you look closely you'll see that the NOREUSE is a no-op! Ugh.

This is really quite awful. Besides Lucene, I can imagine a number of other apps that really should use this flag. For example, when mencoder slowly reads a 50 GB bluray movie, and writes a 5 GB H.264 file, you don't want any of those bytes to pollute your buffer cache. Same thing for rsync, backup programs, software up-to-date checkers, desktop search tools, etc. Of all the flags, this one seems like the most important to get right! It's possible other OSs do the right thing; I haven't tested.

So what to do?

One approach is to forcefully free the pages, using the DONTNEED flag. This will drop the specified pages from the buffer cache. But there's a serious problem: the search process is using certain pages in these files! So you must only drop those pages that the merge process, alone, had pulled in. You can use the mincore function, to query for those pages that are already cached, so you know which ones not to drop. A neat patch for rsync took exactly this approach. The problem with this is mincore provides only a snapshot, so you'd have to call it many times while merging to try to minimize discarding pages that had been recently cached for searching.

We should not have to resort to such silly hacks!

Another approach is to switch to memory-mapped IO, using Lucene's MMapDirectory, and then use madvise. The SEQUENTIAL option looks promising from the man page:

Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)

Looking through the linux sources it look like the SEQUENTIAL option is at least not a no-op; that setting has some influence over how pages are evicted.

So I tested that, but, alas, the search performance still plummets. No go!

Yet another approach is to bypass all OS caching entirely, only when merging, by using the Linux-specific O_DIRECT flag. Merge performance will definitely take a hit, since the OS is no longer doing readahead nor write caching, and every single IO request must hit the disk while you wait, but for many apps this could be a good tradeoff.

So I created a prototype Directory implementation, a variant of DirectNIOFSDirectory (currently a patch on LUCENE-2056), that opened all files (input and output) with O_DIRECT (using jni). It's a little messy because all IO must be "aligned" by certain rules (I followed the rules for 2.6.* kernels).

Finally, this solution worked great! Search performance was unchanged all through the optimize call, including building the CFS at the end. Third time's a charm!

However, the optimize call slowed down from 1336 to 1680 seconds (26% slower). This could likely be reduced by further increasing the buffer sizess (I used 1 MB buffer for each IndexInput and IndexOutput, which is already large), or possibly creating our own readahead / write cache scheme.

We really should not have to resort to such silly hacks!

20 comments:

  1. Hi Mike:

    Was wondering if you did any testing with the MegePolicy we contributed.

    Thanks

    -John

    ReplyDelete
  2. No, I just used Lucene's default MergePolicy; BalancedSegmentMergePolicy wouldn't have changed anything since my test simply calls IndexWriter.optimize(). I wanted a merge that was large enough to force eviction of the IO cache; optimize did the trick (it burns through 23.1G (= 7.7G * 3) bytes).

    ReplyDelete
  3. did you try using mmap to map the index for searching and standard file io for merging? the index would then be cached with the rest of the process' address space and would not be evicted by sequentially scanning the file using fread and fadvise with FADV_SEQUENTIAL. you may need to play with the swappiness kernel parameter to tune performance.

    i think MADV_SEQUENTIAL actively pages out memory after you read further into the file whether that page was previously mapped or not, so that's why using it killed your performance.

    ReplyDelete
  4. I did not try mmap for searching and normal file IO for merging... but I imagine that'll still evict hot searching pages. FADV_SEQUENTIAL looks like it's only about increasing the heuristic readahead that the kernel does, and not about quickly evicting the pages you've read.

    Really MADV/FADV_SEQUENTIAL need to somehow mark a given page for aggressive eviction only if that page was not already resident due to a non-sequential/no-reuse read. Or perhaps maintain two buffer caches: one "transient" one for sequential/noreuse-only reads (which'd in general be tiny), and a second one for "normal" (random-like) file/mmap io.

    Or... O_DIRECT is actually an OK match, except, I'd still want the OS to somehow do the readahead and write caching.

    I had already set swappiness to 0 for all of these tests, so OS would make no effort to swap "idle" process pages out.

    ReplyDelete
  5. Is your O_DIRECT code posted anywhere?

    ReplyDelete
  6. Jonathan: yes.

    It's been committed to Lucene's contrib/misc package, under org.apache.lucene.store, DirectIOLinuxDirectory. This is on both Lucene stable (to be 3.1) and trunk (to be 4.0).

    ReplyDelete
  7. This mean that you use DirectIO for the reader as well (that is, queries)?

    If so, I would expect search performance to drop overall, especially if you got a lot of cache. You loose the ability the OS have to cache group and order I/O operations across multiple treads and processes and this tend to slow things down on newer machines (where memory bandwidth is rarely a bit problem vs. I/O unlike in old days when DIRECTIO was originally made)

    ReplyDelete
  8. Terje: no, I would definitely NOT use O_DIRECT for searching (performance would be horrible). It should only be used for reading & writing during indexing.

    This is easy to do in Lucene: when you create your IndexWriter, use DirectIOLinuxDirectory; when you create your IndexReader/Searcher, use one of the other core Directory implementations (NIOFSDirectory, MMapDirectory).

    During searching we very much want to tap into the OS's buffer cache, we want the OS to do readahead for us, etc. (However, we also want the OS to never swap out the data structures we carefully loaded in RAM; this is why I always set swappiness to 0 when I'm on Linux).

    Really we need an O_DIRECT-like option that still does write caching and read-ahead (but quickly drops pages from the buffer cache). Either that, or, the kernel should actually respect the NOEREUSE flag (it's a no-op now!).

    ReplyDelete
  9. Just thinking about something now... I read somewhere long time saying that ioprio_set() could be applied to threads and not just process ids like the man page says. I have only used it on whole processes myself, so I cannot confirm this.

    A search on google brings up a few postings which seems to confirm this as well.

    Assuming that the optimize routine in lucene is done in a separate thread (and it is possible to get the right ID from that thread in java), have you considered applying that?

    As I said, I have only used this on separate processes. It does tend to help, but not overly much so on the cache problem. Anyway, I guess search responsiveness is what matters here, so anything that can be added to reduce the effect on search helps?

    Be aware that you need a linux kernel I/O scheduler which support i/o priorities as well. This may be a problem in itself as I think it is still just the CFQ scheduler which supports this and it is not always the best choice....

    ReplyDelete
  10. Terje: nice! I did not know ioprio_get/set could operate per thread.

    I haven't done any testing with IO prioritization, but I'd love to. Lucene could very much use this, to decrease priority of IO required for merging.

    This is nicely orthogonal to not polluting the buffer cache. The combination of the two should mean that merging has very little impact on ongoing searches, unlike today, where users have noted sometimes catastrophic performance loss due to ongoing merging.

    The problem is... IO prioritization is in its infancy. Really all OS's should support it well, but (I think?) it's only Linux, and even then only if you use the CFQ IO scheduler. OS/IO is still in the dark ages!!

    Note that we could emulate this in Lucene. It's hackish, but, a Directory implementation could track when open IndexInputs are being used for searching and forcefully pause, or rate limit to some low floor, any ongoing merge-only IO. Such an implementation could work well in practice...

    ReplyDelete
  11. It should be noted that CFQ will also reduce io priority on processes with normal nice (CPU) priority, but ioprio_set does add more control.

    I believe several OSes has O/I priorities and this is one place where Linux is lagging the commercial OSes quite a bit, but a problem is that it is not standardized and rarely used by applications

    It is getting a while ago and my memory is fuzzy, it is not entirely I/O priorities, but GRIO - Guaranteed Rate I/O, was one of the big differentiators claimed by SGI on XFS already 15+ years ago. Since I never had much interest in multimedia type processing and the original GRIO stuff was not all that useful on busy multiuser setups, I did not pay all that much attention to this stuff back then.

    Being a bit lazy today, cannot find a clear answer if GRIO is supported on Linux XFS, so I have no idea there.

    It also seems like the io priority stuff introduced starting windows vista works reasonably well. Vista also seems relatively good at detecting streaming type I/O and not flushing caches/not swapping out processes to free memory for caching.

    I used to process a lot of data on my home computer around the time Vista was introduced, and it behaved way better than XP and a fair bit better than standard Linux distributions when churning through large amounts of data (given enough memory that Vista performs in the first place).

    Would be interesting to know how vista/7 handles lucene index optimizing.

    ReplyDelete
  12. Very interesting to read about your experiments.

    The MADV_SEQUENTIAL behavior has gotten smarter in recent kernels, I wonder if your mmap+madvise() based test results would improve.

    Prior to 2.6.29, MADV_SEQUENTIAL did nothing but increase the kernel's readahead window. In 2.6.29, the following patch was merged:
    http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=4917e5d0499b5ae7b26b56fccaefddf9aec9369c
    AFAICS this would implement the behavior you'd want for the lucene merger.

    ReplyDelete
  13. Hi Brian,

    That patch looks great! So now Linux will prefer to evict pages loaded due to SEQUENTIAL reading, and it looks like if the page is also accessed through non-SEQUENTIAL, it's not aggressively evicted. This does sound great for our use case....

    Though... really, it'd be better if this logic was under the NOREUSE flag instead, I think. Ie, Lucene should pass SEQUENTIAL when reading the postings, since we seek to one spot and then read potentially many bytes from there. Yet, really we want the OS to keep such pages in the buffer cache, ie for "hot" (frequently searched terms), caching would help us.

    This is also good timing: we have a potential GSoC project (http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/varunthacker1989/1) that will make it possible for native dir impls to customize the IO flags depending on whether the file is for merging, searching, etc. Maybe we can test this better SEQUENTIAL behavior with this project...

    ReplyDelete
  14. Hi,
    Since your post, things have been "evolving" in the Linux kernel regarding proper support of POSIX_FADV_NOREUSE for fadvise

    http://thread.gmane.org/gmane.linux.kernel.mm/73853

    What about retesting with a kernel where this patch is obviously included, while using NOREUSE (pun intended) ?

    ReplyDelete
  15. Hi Mark,

    That's very interesting! I would like to re-benchmark. But I can't tell whether anything was committed / which kernel releases have this improvement?

    ReplyDelete
  16. Hi, what about jna to enable O_DIRECT flag? I wrote a couple of classes, available on my blog.

    ReplyDelete
    Replies
    1. Hi Matteo,

      JNA is a good idea! Next time I need to play with native stuff I'll try it. Thanks.

      Delete
  17. This comment has been removed by the author.

    ReplyDelete
  18. Hello Michael

    I am using lucene 4.X on CentOS 6.
    I have an application which deals with tons of data (i receive half million per week, and keep them for 90 days). My usual index size is around (35 to 40G) I use NIODirectory
    and all indexes are on SSDs. I use TieredMergePolicy with all default settings. ConcurrentMergeScheduler with default settings.

    I see very suspicious behaviour in my application regarding search and indexing.

    When i start application first time, with index(~20G) or without index, it works fine for around 10 days or so
    - it returns million of results in search,
    - indexing is good 40 document/sec.

    But suddenly after 10 days or so it gets slower (Searching and Indexing, Queries take 5x time more), I profiled application in that state and is see ~50% cpu is utilized mainly by merge threads (short lived), with no thread leaks. Now interesting part is when i undeploy application cpu usage drops to 20% there are no threads related to lucene. Now i again deploy application and try to search and searching/indexing is still slow. So undeploy didn't help.

    But after that, i restart server and application runs fine even with huge index. This behaviour remains same, after every 10 days or so i see this, situation.

    Is it anything in lucene which is bound to the application server process ? As only after restart situation gets better, undeploy also dosen't work for me.

    ReplyDelete
    Replies
    1. That's really quite strange. Maybe turn on IndexWriter's infoStream (IndexWriterConfig.setInfoStream)? It could shed some light about why you have too much merging happening. Also try posting to java-user@lucene.apache.org with the details?

      Delete