Saturday, June 5, 2010

Lucene's PulsingCodec on "Primary Key" Fields

Update Aug, 2014: the pulsing approach described here works well and has now been incorporated into Lucene's default postings format, so there's really no need to use PulsingPostingsFormat yourself unless you are using a custom postings format that doesn't do its own pulsing.

Flexible indexing in Lucene (now available on trunk, which will eventually be the next major release, 4.0) enables apps to use custom codecs to write/read the postings (fields, terms, docs, positions, payloads).

By default, Lucene uses the StandardCodec, which writes and reads in nearly the same format as the current stable branch (3.x). Details for a given term are stored in terms dictionary files, while the docs and positions where that term occurs are stored in separate files.

But there is an experimental codec, PulsingCodec, which implements the pulsing optimization described in a paper by Doug Cutting and Jan Pedersen. The idea is to inline the docs/positions/payloads data into the terms dictionary for low frequency terms, so that you save 1 disk seek when retrieving document(s) for that term.

The PulsingCodec wraps another fallback Codec that you provide; this allows the pulsing to be dynamic, per term. For each term, if its frequency (the number of documents that it appears in) is below a threshold (default 1) that you provide, then that term's postings are inlined into the terms dictionary; otherwise, the term is forwarded (pulsed) to the wrapped codec. This means PulsingCodec should be helpful for ordinary text fields which obey Zipf's Law, as many terms will be rare-ish.

PulsingCodec should really shine on "primary key" fields, where each term occurs in exactly one document, and batch lookups (for example because the app performs deletes, updates and/or lookups) are common.

I created a simple performance test to confirm this.

The test first creates an optimized index with 10M docs, where each doc has a single field with a randomly generated unique term, and then performs term -> doc lookup for N (parameter) random terms. It's a self-contained test (source code is here).

It's important to flush your OS's IO cache before running the test; otherwise you can't measure the reduced number of seeks. On recent Linux kernels, just run echo 1 > /proc/sys/vm/drop_caches. That said, in a real production usage, the IO cache will typically (legitimately) help you, and pulsing should make more efficient use of the IO cache since the postings data is contiguously stored.

To measure the speedup from using PulsingCodec on a primary key field, as well as the impact of the OS's IO cache, I ran the above test on an increasing number of random term lookups (always flushing the the OS's IO cache first):



The results are compelling! When performing a small number of term lookups relative to the total number of terms on a cold OS IO cache, which is likely the more common case in a real application, pulsing shows a ~45-50% speedup, as expected, since it requires 1/2 the seeks.

As the number of random term lookups increases, PulsingCodec's gains decrease, because more and more of the lookups are hitting the OS's IO cache and thus avoiding the seek (the machine I ran the test on had plenty of RAM to cache the entire index). It's interesting that PulsingCodec still shows ~15% gain once the lookups are mostly cached; likely this is because PulsingCodec saves the deref cost of finding the postings in the frq file.

Pulsing also makes the index a bit smaller (211 MB vs 231 MB), because it saves one vLong pointer per term. For the test, the index with pulsing had a 0 byte frq file since all postings were inlined into the terms dict. There is no prx file because I index the field with setOmitTermFreqAndPositions(true).

Note that the test case simply uses PulsingCodec for all fields; if you'd like per-field control you should use the PerFieldCodecWrapper. However, because PulsingCodec is dynamic (per term), it is likely a good default for all fields.

Another way to speed up primary key lookups through Lucene is to store your index on a solid-state disk, where seeks are much less costly than they are on spinning magnets (though, still several orders of magnitude more costly than RAM). Or better yet, do both!

9 comments:

  1. FYI I have updated the test rig here: https://issues.apache.org/jira/browse/LUCENE-4069

    I notice that Pulsing Codec is now slower than the standard Lucene40.
    Also, the 4069 Jira includes a BloomFilter codec that looks to be 35% faster than Lucene40.

    ReplyDelete
  2. Hi Mark,

    That's news to me that PulsingPostingsFormat is now slower than standard Lucene40; it's hard to explain because it should only save seeks. Did you flush the IO cache between each run?

    The BloomFilter results sound great!

    ReplyDelete
    Replies
    1. I'm on Windows here so flushing is hard without a reboot.
      I've tried it having rebuilt the index from cold but the main figures I focused on were from "warmed" runs where the results were fastest and stabilised i.e. I was not seeing a change from one run to the next.

      Delete
  3. source code links not working :-( says "This site is under maintain! We will be back soon."

    Michael would you be so kind to share the source some other way?

    Thanks!

    ReplyDelete
    Replies
    1. Hmm, not sure why I used that site :)

      Here's a better link: https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/perf/PKLookupPerfTest.java

      But the sources may have changed since I made this blog post (long ago).

      Delete
  4. Hey Mike,
    I am new to this technology and I am confused with the terms mentioned below.
    Are Terms Dictionary and Inverted Index same?

    ReplyDelete
    Replies
    1. Hi Anonymous,

      The Terms Dictionary is a subset of the Inverted Index; it's responsible for recording all unique terms ever encountered during indexing, and then mapping each of those terms to metadata (docFreq, totalTermFreq, etc.) as well as to the list of documents and positions and offsets for that term (the postings).

      The Inverted Index refers to both the Terms Dictionary, and then all the postings.

      Beyond that there's also stored fields, term vectors, deleted documents bitsets, doc values.

      Delete
    2. Thanks Mike for the help really appreciated :)

      Delete