Changing Bits: Screaming fast Lucene searches using C++ via JNI

Wednesday, June 19, 2013

Screaming fast Lucene searches using C++ via JNI

At the end of the day, when Lucene executes a query, after the initial setup the true hot-spot is usually rather basic code that decodes sequential blocks of integer docIDs, term frequencies and positions, matches them (e.g. taking union or intersection for BooleanQuery), computes a score for each hit and finally saves the hit if it's competitive, during collection.

Even apparently complex queries like FuzzyQuery or WildcardQuery go through a rewrite process that reduces them to much simpler forms like BooleanQuery.

Lucene's hot-spots are so simple that optimizing them by porting them to native C++ (via JNI) was too tempting!

So I did just that, creating the lucene-c-boost github project, and the resulting speedups are exciting:

Task	QPS base	StdDev base	QPS opt	StdDev opt	% change
AndHighLow	469.2	(0.9%)	316.0	(0.7%)	0.7 X
Fuzzy1	63.0	(3.3%)	62.9	(2.0%)	1.0 X
Fuzzy2	25.8	(3.1%)	37.9	(2.3%)	1.5 X
AndHighMed	50.4	(0.7%)	110.0	(0.9%)	2.2 X
OrHighNotLow	46.8	(5.6%)	106.3	(1.3%)	2.3 X
LowTerm	298.6	(1.8%)	691.4	(3.4%)	2.3 X
OrHighNotMed	34.0	(5.3%)	89.2	(1.3%)	2.6 X
OrHighNotHigh	5.0	(5.7%)	14.2	(0.8%)	2.8 X
Wildcard	17.2	(1.2%)	51.1	(9.5%)	3.0 X
AndHighHigh	21.9	(1.0%)	69.0	(1.0%)	3.5 X
OrHighMed	18.7	(5.7%)	59.6	(1.1%)	3.2 X
OrHighHigh	6.7	(5.7%)	21.5	(0.9%)	3.2 X
OrHighLow	15.7	(5.9%)	50.8	(1.2%)	3.2 X
MedTerm	69.8	(4.2%)	243.0	(2.2%)	3.5 X
OrNotHighHigh	13.3	(5.7%)	46.7	(1.4%)	3.5 X
OrNotHighMed	26.7	(5.8%)	115.8	(2.8%)	4.3 X
HighTerm	22.4	(4.2%)	109.2	(1.4%)	4.9 X
Prefix3	10.1	(1.1%)	55.5	(3.7%)	5.5 X
OrNotHighLow	62.9	(5.5%)	351.7	(9.3%)	5.6 X
IntNRQ	5.0	(1.4%)	38.7	(2.1%)	7.8 X

These results are on the full, multi-segment Wikipedia English index with 33.3 M documents. Besides the amazing speedups, it's also nice to see that the variance (StdDev column) is generally lower with the optimized C++ version, because hotspot has (mostly) been taken out of the equation.

The API is easy to use, and works with the default codec so you won't have to re-index just to try it out: instead of IndexSearcher.search, call NativeSearch.search. If the query can be optimized, it will be; otherwise it will seamlessly fallback to IndexSearcher.search. It's fully decoupled from Lucene and works with the stock Lucene 4.3.0 JAR, using Java's reflection APIs to grab the necessary bits.

This is all very new code, and I'm sure there are plenty of exciting bugs, but (after some fun debugging!) all Lucene core tests now pass when using NativeSearch.search.

This is not a C++ port of Lucene

This code is definitely not a general C++ port of Lucene. Rather, it implements a very narrow set of classes, specifically the common query types. The implementations are not general-purpose: they hardwire (specialize) specific code, removing all abstractions like Scorer, DocsEnum, Collector, DocValuesProducer, etc.

There are some major restrictions on when the optimizations will apply:

Only tested on Linux and Intel CPU so far
Requires Lucene 4.3.x
Must use NativeMMapDirectory as your Directory implementation, which maps entire files into RAM (avoids the chunking that the Java-based MMapDirectory must do)
Must use the default codec
Only sort by score is supported
None of the optimized implementations use advance: first, this code is rather complex and will be quite a bit of work to port to C++, and second, queries that benefit from advance are generally quite fast already so we may as well leave them in Java

BooleanQuery is optimized, but only when all clauses are TermQuery against the same field.

C++ is not faster than java!

Not necessarily, anyway: before anyone goes off screaming how these results "prove" Java is so much slower than C++, remember that this is far from a "pure" C++ vs Java test. There are at least these three separate changes mixed in:

Algorithmic changes. For example, lucene-c-boost sometimes uses BooleanScorer where Lucene is using BooleanScorer2. Really we need to fix Lucene to do similar algorithmic changes (when they are faster). In particular, all of the OrXX queries that include a Not clause, as well as IntNRQ in the above results, benefit from algorithmic changes.
Code specialization: lucene-c-boost implements the searches as big hardwired scary-looking functions, removing all the nice Lucene abstractions. While abstractions are obviously necessary in Lucene, they unfortunately add run-time cost overhead, so removing these abstractions gives some gains.
C++ vs java.

It's not at all clear how much of the gains are due to which part; really I need to create the "matching" specialized Java sources to do a more pure test.

This code is dangerous!

Specifically, whenever native C++ code is embedded in Java, there is always the risk of all those fun problems with C++ that we Java developers thought we left behind. For example, if there are bugs (likely!), or even innocent API mis-use by the application such as accidentally closing an IndexReader while other threads are still using it, the process will hit a Segmentation Fault and the OS will destroy the JVM. There may also be memory leaks! And, yes, the C++ sources even use the goto statement.

Work in progress...

This is a work in progress and there are still many ideas to explore. For example, Lucene 4.3.x's default PostingsFormat stores big-endian longs, which means the little-endian Intel CPU must do byte-swapping when decoding each postings block, so one thing to try is a PostingsFormat better optimized for the CPU at search time. Positional queries, Filters and nested BooleanQuery are not yet optimized, as well as certain configurations (e.g., fields that omit norms). Patches welcome!

Nevertheless, initial results are very promising, and if you are willing to risk the dangers in exchange for massive speedups please give it a whirl and report back.

20 comments:

Han JiangJune 20, 2013 at 5:46 AM
Interesting! And much gain since LUCENE-5049!

Mike, which version of python are you using for the build.py script? Actually on my machine I hit syntax error noise either with py2.7 or py3.3.

Will the codes work on x64 machine? I was quite curious about the perf test, and followed the installation guide to get the .so file compiled and .patch merged. But during ant test, there're some reflection errors, will you have a look at this ?

ReplyDelete
Replies
Michael McCandlessJune 20, 2013 at 7:03 AM
Hi Han,

Actually I'm using Python 2.7.3 to run build.py; what syntax error are you hitting?

The code works on x64 (that's what I'm running on, an i7-3770K).

Those reflection errors are because you're running with 4.x not 4.3.x ... there were a few changes that affect the reflection.
ReplyDelete
Replies
Delip RaoJune 21, 2013 at 6:53 PM
This is very cool. Any idea how I could plug this in Solr?
ReplyDelete
Replies
AndyJune 24, 2013 at 8:20 PM
This looks great!

1) What does AndHighLow do? It's the only task that sees a major regression. Any idea why?

2) Any chance you can port this to Solr? More people could benefit from this work that way since there're probably more people running Solr than raw Lucene.
ReplyDelete
Replies
AndyJune 25, 2013 at 10:15 PM
FunctionQuery is used to boost document scores by the log of a "popularity" value.

Is that something that lucene-c-boost could potentially support? Thanks.
ReplyDelete
Replies
AndyJune 26, 2013 at 5:48 PM
Popularity is just a field of type integer. Is there anything I can do to put it in FieldCache? According to the wiki (http://wiki.apache.org/solr/SolrCaching#The_Lucene_FieldCache) FieldCache is not managed by Solr and there's no config option so not sure what could be done here. Right now it is also not in DocValues. Is it just a case of adding docValues="true" to schema.xml? If so that could be easily done.

Popularity is combined with relevance score using the !boost function. SOmething similar to http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29

It'd be really great if lucene-c-boost works for something like this. It's probably a very common use case.
ReplyDelete
Replies
Michael McCandlessJune 27, 2013 at 6:56 AM
HI Andy,

I'm not sure how to configure Solr to use doc values; this was only added in recent Solr releases. Maybe ask on solr-user@lucene.apache.org?

I'll think about how to allow boosting like this in lucene-c-boost.
ReplyDelete
Replies
AndyJune 27, 2013 at 11:07 PM
Michael,

I'll ask about DocValues on the Solr mailing list.

Thanks for considering this use case. Looking forward to new versions of lucene-c-boost.
ReplyDelete
Replies

Add comment