Changing Bits: Lucene's PulsingCodec on "Primary Key" Fields

Saturday, June 5, 2010

Lucene's PulsingCodec on "Primary Key" Fields

Update Aug, 2014: the pulsing approach described here works well and has now been incorporated into Lucene's default postings format, so there's really no need to use PulsingPostingsFormat yourself unless you are using a custom postings format that doesn't do its own pulsing.

Flexible indexing in Lucene (now available on trunk, which will eventually be the next major release, 4.0) enables apps to use custom codecs to write/read the postings (fields, terms, docs, positions, payloads).

By default, Lucene uses the StandardCodec, which writes and reads in nearly the same format as the current stable branch (3.x). Details for a given term are stored in terms dictionary files, while the docs and positions where that term occurs are stored in separate files.

But there is an experimental codec, PulsingCodec, which implements the pulsing optimization described in a paper by Doug Cutting and Jan Pedersen. The idea is to inline the docs/positions/payloads data into the terms dictionary for low frequency terms, so that you save 1 disk seek when retrieving document(s) for that term.

The PulsingCodec wraps another fallback Codec that you provide; this allows the pulsing to be dynamic, per term. For each term, if its frequency (the number of documents that it appears in) is below a threshold (default 1) that you provide, then that term's postings are inlined into the terms dictionary; otherwise, the term is forwarded (pulsed) to the wrapped codec. This means PulsingCodec should be helpful for ordinary text fields which obey Zipf's Law, as many terms will be rare-ish.

PulsingCodec should really shine on "primary key" fields, where each term occurs in exactly one document, and batch lookups (for example because the app performs deletes, updates and/or lookups) are common.

I created a simple performance test to confirm this.

The test first creates an optimized index with 10M docs, where each doc has a single field with a randomly generated unique term, and then performs term -> doc lookup for N (parameter) random terms. It's a self-contained test (source code is here).

It's important to flush your OS's IO cache before running the test; otherwise you can't measure the reduced number of seeks. On recent Linux kernels, just run echo 1 > /proc/sys/vm/drop_caches. That said, in a real production usage, the IO cache will typically (legitimately) help you, and pulsing should make more efficient use of the IO cache since the postings data is contiguously stored.

To measure the speedup from using PulsingCodec on a primary key field, as well as the impact of the OS's IO cache, I ran the above test on an increasing number of random term lookups (always flushing the the OS's IO cache first):

The results are compelling! When performing a small number of term lookups relative to the total number of terms on a cold OS IO cache, which is likely the more common case in a real application, pulsing shows a ~45-50% speedup, as expected, since it requires 1/2 the seeks.

As the number of random term lookups increases, PulsingCodec's gains decrease, because more and more of the lookups are hitting the OS's IO cache and thus avoiding the seek (the machine I ran the test on had plenty of RAM to cache the entire index). It's interesting that PulsingCodec still shows ~15% gain once the lookups are mostly cached; likely this is because PulsingCodec saves the deref cost of finding the postings in the frq file.

Pulsing also makes the index a bit smaller (211 MB vs 231 MB), because it saves one vLong pointer per term. For the test, the index with pulsing had a 0 byte frq file since all postings were inlined into the terms dict. There is no prx file because I index the field with setOmitTermFreqAndPositions(true).

Note that the test case simply uses PulsingCodec for all fields; if you'd like per-field control you should use the PerFieldCodecWrapper. However, because PulsingCodec is dynamic (per term), it is likely a good default for all fields.

Another way to speed up primary key lookups through Lucene is to store your index on a solid-state disk, where seeks are much less costly than they are on spinning magnets (though, still several orders of magnitude more costly than RAM). Or better yet, do both!

10 comments:

SearchManJune 20, 2010 at 2:37 AM
Excellent work Michael!
ReplyDelete
Replies
Mark HarwoodMay 25, 2012 at 11:23 AM
FYI I have updated the test rig here: https://issues.apache.org/jira/browse/LUCENE-4069

I notice that Pulsing Codec is now slower than the standard Lucene40.
Also, the 4069 Jira includes a BloomFilter codec that looks to be 35% faster than Lucene40.
ReplyDelete
Replies
Michael McCandlessMay 25, 2012 at 12:05 PM
Hi Mark,

That's news to me that PulsingPostingsFormat is now slower than standard Lucene40; it's hard to explain because it should only save seeks. Did you flush the IO cache between each run?

The BloomFilter results sound great!
ReplyDelete
Replies
AnonymousOctober 30, 2013 at 3:47 AM
source code links not working :-( says "This site is under maintain! We will be back soon."

Michael would you be so kind to share the source some other way?

Thanks!
ReplyDelete
Replies
AnonymousJanuary 21, 2014 at 3:08 AM
Hey Mike,
I am new to this technology and I am confused with the terms mentioned below.
Are Terms Dictionary and Inverted Index same?
ReplyDelete
Replies
unsguDecember 15, 2014 at 2:22 AM
Thank you for information..
ReplyDelete
Replies

Add comment