Monday, July 5, 2010

Lucene's RAM usage for searching

For fast searching, Lucene loads certain data structures entirely into RAM:

  • The terms dict index requires substantial RAM per indexed term (by default, every 128th unique term), and is loaded when IndexReader is created. This can be a very large amount of RAM for indexes that have an unusually high number of unique terms; to reduce this, you can pass a terms index divisor when opening the reader. For example, passing 2, which loads only every other indexed term, halves the RAM required. But, in tradeoff, seeking to a given term, which is required once for every TermQuery, will become slower as Lucene must do twice as much scanning (on average) to find the term.

  • Field cache, which is used under-the-hood when you sort by a field, takes some amount of per-document RAM depending on the field type (String is by far the worst). This is loaded the first time you sort on that field.

  • Norms, which encode the a-priori document boost computed at indexing time, including length normalization and any boosting the app does, consume 1 byte per field X document used for searching. For example, if your app searches 3 different fields, such as body, title and abstract, then that requires 3 bytes of RAM, per document. These are loaded on-demand the first time that field is searched.

  • Deletions, if present, consume 1 bit per doc, created during IndexReader construction.

Warming a reader is necessary because of the data structures that are initialized lazily (norms, FieldCache). It's also useful to pre-populate the OS's IO cache with those pages that cover the frequent terms you're searching on.

With flexible indexing, available in Lucene's trunk (4.0-dev), we've made great progress on reducing the RAM required for both the terms dict index and the String index field cache (some details here). We have substantially reduced the number of objects created for these RAM resident data structures, and switched to representing all character data as UTF8, not java's char, which halves the RAM required when the character data is simple ascii.

So, I ran a quick check against a real index, created from the first 5 million documents from the Wikipedia database export. The index has a single segment with no deletions. I initialize a searcher, and then load norms for the body field, and populate the FieldCache for sorting by the title field, using JRE 1.6, 64bit:

  • 3.1-dev requires 674 MB of RAM

  • 4.0-dev requires 179 MB of RAM

That's a 73% reduction on RAM required!

However, there seems to be some performance loss when sorting by a String field, which we are still tracking down.

Note that modern OSs will happily swap out RAM from a process, in order to increase the IO cache. This is rather silly: Lucene loads these specific structures into RAM because we know we will need to randomly access them, a great many times. Other structures, like the postings data, we know we will sweep sequentially once per search, so it's less important that these structures be in RAM. When the OS swaps our RAM out in favor of IO cache, it's reversing this careful separation!

This will of course cause disastrous search latency for Lucene, since many page faults may be incurred on running a given search. On Linux, you can fix this by tuning swappiness down to 0, which I try to do on every Linux computer I touch (most Linux distros default this to a highish number). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.

12 comments:

  1. FS Cache Vs Java heap - Would have been nice if the swappiness factor could have been provided per process. Though if you're running a dedicated Search server then this is less of a problem.

    ReplyDelete
  2. Hi Gili,

    Maybe there is some way, but I don't know about it! That sure would be nice.

    ReplyDelete
    Replies
    1. There's actually a way to set swappiness per process if you're on a new-ish Linux kernel (2.6.24 or above?). Check out the memory controller for cgroups: https://www.kernel.org/doc/Documentation/cgroups/memory.txt

      You can also set lots of other things as well, like RAM used for RSS + cache, etc., and there are other controllers for things like CPU priority.

      Delete
    2. cgroups looks great! Thanks for sharing Stephen.

      Delete
  3. In case I use MMapDirectory, should I worry about the O.S swappiness?

    ReplyDelete
  4. Hi kbros,

    You should stil worry about OS swappiness even when using MMapDir, if you care about search latency.

    Lucene loads certain structures into RAM (deleted docs, norms, terms index, field cache / doc values) and if the OS swaps those out it will cause latency spikes in your searching.

    I turn it off (set swappiness to 0) on every Linux box I touch... swapping is a poor abstraction.

    ReplyDelete
  5. hi Mike, So when you say -"The terms dict index requires substantial RAM per indexed term". Does RAM indicate heap memory ? Or are you referring to non-heap memory?

    ReplyDelete
    Replies
    1. Hi Ashish,

      I mean heap memory, i.e. allocated Java objects. But the terms index RAM usage in 4.x is now a tiny fraction of what it used to be...

      Delete
  6. Hi Michael the method you provide for turning down the IO Caching of the OS for windows doesn't seem to exist starting Windows 2008 upwards. The current option on Windows 2008 upwards is only the ->System Properties -> Advanced -> Performance -> Advanced -> Adjust for Performance of "Programs/ Background Services" which actually would only control the processor scheduling.

    However Microsoft does seem to provide a Dynamic Cache Service available for download for Windows 2008 http://support.microsoft.com/kb/976618 & for Windows 2008 R2 it can be got only via a MSDN ticket.

    ReplyDelete
  7. Thanks Swami; it's spooky that it's getting harder in Windows to have it NOT swap out your process's RAM.

    ReplyDelete
  8. Hi Michael, looks like terms index divisor is no longer supported. Is there some other way of controlling what is loaded into memory?

    ReplyDelete
    Replies
    1. Hi Murali,

      In fact, the default postings format (Lucene50PostingsFormat) takes two parameters (a min and a max int) saying how many terms should be written into each on-disk block. They default to 25 and 48, but if you increase them then you will see the same effect of increasing the terms index divisor from older Lucene releases.

      Delete