- The terms dict index requires substantial RAM per indexed term (by default, every 128th unique term), and is loaded when IndexReader is created. This can be a very large amount of RAM for indexes that have an unusually high number of unique terms; to reduce this, you can pass a terms index divisor when opening the reader. For example, passing 2, which loads only every other indexed term, halves the RAM required. But, in tradeoff, seeking to a given term, which is required once for every TermQuery, will become slower as Lucene must do twice as much scanning (on average) to find the term.
- Field cache, which is used under-the-hood when you sort by a field, takes some amount of per-document RAM depending on the field type (String is by far the worst). This is loaded the first time you sort on that field.
- Norms, which encode the a-priori document boost computed at indexing time, including length normalization and any boosting the app does, consume 1 byte per field X document used for searching. For example, if your app searches 3 different fields, such as body, title and abstract, then that requires 3 bytes of RAM, per document. These are loaded on-demand the first time that field is searched.
- Deletions, if present, consume 1 bit per doc, created during IndexReader construction.
Warming a reader is necessary because of the data structures that are initialized lazily (norms, FieldCache). It's also useful to pre-populate the OS's IO cache with those pages that cover the frequent terms you're searching on.
With flexible indexing, available in Lucene's trunk (4.0-dev), we've made great progress on reducing the RAM required for both the terms dict index and the String index field cache (some details here). We have substantially reduced the number of objects created for these RAM resident data structures, and switched to representing all character data as UTF8, not java's char, which halves the RAM required when the character data is simple ascii.
So, I ran a quick check against a real index, created from the first 5 million documents from the Wikipedia database export. The index has a single segment with no deletions. I initialize a searcher, and then load norms for the body field, and populate the FieldCache for sorting by the title field, using JRE 1.6, 64bit:
- 3.1-dev requires 674 MB of RAM
- 4.0-dev requires 179 MB of RAM
That's a 73% reduction on RAM required!
However, there seems to be some performance loss when sorting by a String field, which we are still tracking down.
Note that modern OSs will happily swap out RAM from a process, in order to increase the IO cache. This is rather silly: Lucene loads these specific structures into RAM because we know we will need to randomly access them, a great many times. Other structures, like the postings data, we know we will sweep sequentially once per search, so it's less important that these structures be in RAM. When the OS swaps our RAM out in favor of IO cache, it's reversing this careful separation!
This will of course cause disastrous search latency for Lucene, since many page faults may be incurred on running a given search. On Linux, you can fix this by tuning swappiness down to 0, which I try to do on every Linux computer I touch (most Linux distros default this to a highish number). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.