Saturday, June 22, 2013

2X faster PhraseQuery with Lucene using C++ via JNI

I recently described the new lucene-c-boost github project, which provides amazing speedups (up to 7.8X faster) for common Lucene query types using specialized C++ implementations via JNI.

The code works with a stock Lucene 4.3.0 JAR and default codec, and has a trivial API: just call NativeSearch.search instead of IndexSearcher.search.

Now, a quick update: I've optimized PhraseQuery now as well:

TaskQPS base StdDev base QPS opt StdDev opt % change
HighPhrase3.5(2.7%)6.5(0.4%)1.9 X
MedPhrase27.1(1.4%)51.9(0.3%)1.9 X
LowPhrase7.6(1.7%)16.4(0.3%)2.2 X


~2X speedup (~90% - ~119%) is nice!

Again, it's great to see a reduced variance on the runtimes since hotspot is mostly not an issue. It's odd that LowPhrase gets slower QPS than MedPhrase: these queries look mis-labelled (I see the LowPhrase queries getting more hits than MedPhrase!).

All changes have been pushed to lucene-c-boost; next I'd like to figure out how to get facets working.

4 comments:

  1. Hey Mike, interesting

    Out of interest, do you have any theories about your why the Java code is so much slower?
    Have you learnt anything about C++ versus Java optimization here?

    Cheers,
    Simon

    ReplyDelete
    Replies
    1. I suspect most of the gains are from specializing/hardwiring the code to a specific query, collector, etc., but I haven't done the obvious test (create the same specialized code in Java instead of C)...

      Delete
    2. This is not surprising. Even though Java's performance is close to that of C++, it seems that there is still about a 1.5-2x difference
      See, e.g.:
      http://benchmarksgame.alioth.debian.org/u32/java.php

      Delete
    3. Thanks for sharing that link Itman.

      Delete