Flexible indexing in Lucene (now available on trunk, which will eventually be the next major release, 4.0) enables apps to use custom codecs to write/read the postings (fields, terms, docs, positions, payloads).
By default, Lucene uses the StandardCodec, which writes and reads in nearly the same format as the current stable branch (3.x). Details for a given term are stored in terms dictionary files, while the docs and positions where that term occurs are stored in separate files.
But there is an experimental codec, PulsingCodec, which implements the pulsing optimization described in a paper by Doug Cutting and Jan Pedersen. The idea is to inline the docs/positions/payloads data into the terms dictionary for low frequency terms, so that you save 1 disk seek when retrieving document(s) for that term.
The PulsingCodec wraps another fallback Codec that you provide; this allows the pulsing to be dynamic, per term. For each term, if its frequency (the number of documents that it appears in) is below a threshold (default 1) that you provide, then that term's postings are inlined into the terms dictionary; otherwise, the term is forwarded (pulsed) to the wrapped codec. This means PulsingCodec should be helpful for ordinary text fields which obey Zipf's Law, as many terms will be rare-ish.
PulsingCodec should really shine on "primary key" fields, where each term occurs in exactly one document, and batch lookups (for example because the app performs deletes, updates and/or lookups) are common.
I created a simple performance test to confirm this.
The test first creates an optimized index with 10M docs, where each doc has a single field with a randomly generated unique term, and then performs term -> doc lookup for N (parameter) random terms. It's a self-contained test (source code is here).
It's important to flush your OS's IO cache before running the test; otherwise you can't measure the reduced number of seeks. On recent Linux kernels, just run echo 1 > /proc/sys/vm/drop_caches. That said, in a real production usage, the IO cache will typically (legitimately) help you, and pulsing should make more efficient use of the IO cache since the postings data is contiguously stored.
To measure the speedup from using PulsingCodec on a primary key field, as well as the impact of the OS's IO cache, I ran the above test on an increasing number of random term lookups (always flushing the the OS's IO cache first):
The results are compelling! When performing a small number of term lookups relative to the total number of terms on a cold OS IO cache, which is likely the more common case in a real application, pulsing shows a ~45-50% speedup, as expected, since it requires 1/2 the seeks.
As the number of random term lookups increases, PulsingCodec's gains decrease, because more and more of the lookups are hitting the OS's IO cache and thus avoiding the seek (the machine I ran the test on had plenty of RAM to cache the entire index). It's interesting that PulsingCodec still shows ~15% gain once the lookups are mostly cached; likely this is because PulsingCodec saves the deref cost of finding the postings in the frq file.
Pulsing also makes the index a bit smaller (211 MB vs 231 MB), because it saves one vLong pointer per term. For the test, the index with pulsing had a 0 byte frq file since all postings were inlined into the terms dict. There is no prx file because I index the field with setOmitTermFreqAndPositions(true).
Note that the test case simply uses PulsingCodec for all fields; if you'd like per-field control you should use the PerFieldCodecWrapper. However, because PulsingCodec is dynamic (per term), it is likely a good default for all fields.
Another way to speed up primary key lookups through Lucene is to store your index on a solid-state disk, where seeks are much less costly than they are on spinning magnets (though, still several orders of magnitude more costly than RAM). Or better yet, do both!