BooleanQuery
), computes a score for each hit and
finally saves the hit if it's competitive, during collection.
Even apparently complex queries like
FuzzyQuery
or WildcardQuery
go through a rewrite process that
reduces them to much simpler forms like BooleanQuery
.
Lucene's hot-spots are so simple that optimizing them by porting them to native C++ (via JNI) was too tempting!
So I did just that, creating the lucene-c-boost github project, and the resulting speedups are exciting:
|
These results are on the full, multi-segment Wikipedia English index with 33.3 M documents. Besides the amazing speedups, it's also nice to see that the variance (StdDev column) is generally lower with the optimized C++ version, because hotspot has (mostly) been taken out of the equation.
The API is easy to use, and works with the default codec so you won't have to re-index just to try it out: instead of
IndexSearcher.search
, call
NativeSearch.search
. If the query can be optimized, it
will be; otherwise it will seamlessly fallback to
IndexSearcher.search
. It's fully decoupled from Lucene
and works with the stock Lucene 4.3.0 JAR, using Java's reflection
APIs to grab the necessary bits.
This is all very new code, and I'm sure there are plenty of exciting bugs, but (after some fun debugging!) all Lucene core tests now pass when using
NativeSearch.search
.
This is not a C++ port of Lucene
This code is definitely not a general C++ port of Lucene. Rather, it implements a very narrow set of classes, specifically the common query types. The implementations are not general-purpose: they hardwire (specialize) specific code, removing all abstractions like
Scorer
,
DocsEnum
, Collector
, DocValuesProducer
,
etc.
There are some major restrictions on when the optimizations will apply:
- Only tested on Linux and Intel CPU so far
- Requires Lucene 4.3.x
- Must use
NativeMMapDirectory
as yourDirectory
implementation, which maps entire files into RAM (avoids the chunking that the Java-basedMMapDirectory
must do) - Must use the default codec
- Only sort by score is supported
- None of the optimized implementations use
advance
: first, this code is rather complex and will be quite a bit of work to port to C++, and second, queries that benefit from advance are generally quite fast already so we may as well leave them in Java
BooleanQuery
is optimized, but only when all clauses
are TermQuery
against the same field.
C++ is not faster than java!
Not necessarily, anyway: before anyone goes off screaming how these results "prove" Java is so much slower than C++, remember that this is far from a "pure" C++ vs Java test. There are at least these three separate changes mixed in:
- Algorithmic changes. For example, lucene-c-boost sometimes uses
BooleanScorer
where Lucene is usingBooleanScorer2
. Really we need to fix Lucene to do similar algorithmic changes (when they are faster). In particular, all of theOrXX
queries that include aNot
clause, as well asIntNRQ
in the above results, benefit from algorithmic changes. - Code
specialization: lucene-c-boost
implements the searches as big hardwired scary-looking functions,
removing all the nice Lucene abstractions. While abstractions
are obviously necessary in Lucene, they unfortunately add
run-time cost overhead, so removing these abstractions gives some
gains.
- C++ vs java.
This code is dangerous!
Specifically, whenever native C++ code is embedded in Java, there is always the risk of all those fun problems with C++ that we Java developers thought we left behind. For example, if there are bugs (likely!), or even innocent API mis-use by the application such as accidentally closing an
IndexReader
while other threads
are still using it, the process will hit
a Segmentation
Fault and the OS will destroy the JVM. There may also be memory
leaks! And, yes, the C++ sources even use
the goto statement.
Work in progress...
This is a work in progress and there are still many ideas to explore. For example, Lucene 4.3.x's default
PostingsFormat
stores big-endian longs, which means the
little-endian Intel CPU must do byte-swapping when decoding each
postings block, so one thing to try is a PostingsFormat
better optimized for the CPU at search time. Positional queries,
Filters and nested BooleanQuery
are not yet optimized, as well as
certain configurations (e.g., fields that omit norms). Patches
welcome!
Nevertheless, initial results are very promising, and if you are willing to risk the dangers in exchange for massive speedups please give it a whirl and report back.
Interesting! And much gain since LUCENE-5049!
ReplyDeleteMike, which version of python are you using for the build.py script? Actually on my machine I hit syntax error noise either with py2.7 or py3.3.
Will the codes work on x64 machine? I was quite curious about the perf test, and followed the installation guide to get the .so file compiled and .patch merged. But during ant test, there're some reflection errors, will you have a look at this ?
Hi Han,
ReplyDeleteActually I'm using Python 2.7.3 to run build.py; what syntax error are you hitting?
The code works on x64 (that's what I'm running on, an i7-3770K).
Those reflection errors are because you're running with 4.x not 4.3.x ... there were a few changes that affect the reflection.
Oh... I see, I used the wrong branch. The running test seems ok now!
DeleteAs for the syntax error, oh it is just a minor one: I suppose it should return True instead of true in 'newer()'? So sad python doesn't fully check syntax by default.
By the way, what arguments do you pass to luceneutil? Is 'base' set as default codec+MMapDirectory, and 'comp' set as default codec+NativeMMapDirectory?
DeleteAhh, true/True, you're right: OK I pushed a fix. Thanks!
DeleteYes, base is a stock 43x checkout with MMapDir, comp is a 43x checkout w/ that checked-in patch applied (which you'll have to tweak since it has hard-coded paths in there ...), using NativeMMapDirectory.
OK! The tests pass on my machine now. But our luceneutil is based on trunk right? I cannot make the perf test pass compilation.
DeleteOh sorry, that's right. You need to 1) copy the full luceneutil tree into your 43x checkout's root dir (i.e., side by side with "lucene"), and then 2) apply the patch here: http://pastebin.com/dcDKzxyC
DeleteDo that for both base and comp checkout.
Whenever luceneutil sees a luceneutil subdir in the base or comp checkout, it will compile/run those sources for its tests.
Hmm, so we're not using competition now? Here is my example.py: http://pastebin.com/9FzqCqUC
DeleteWhoops, the wrong one:
Deletehttp://pastebin.com/zuUF6za1
And my current directory structure:
http://pastebin.com/DPyAvFbB
Oh, I got all the three luceneutil patched, seems that I still miss some configs?
This is very cool. Any idea how I could plug this in Solr?
ReplyDeleteHi Delip,
DeleteThat's a good question, I haven't explored that.
It'd likely require a patch to SolrIndexSearcher.java, to invoke NativeSearch.search when appropriate, and to add the libLuceneCBoost.jar onto the CLASSPATH and libNativeSearch.so on the LD_LIBRARY_PATH.
But since Solr so often needs to get the full bitset back for all matching docs (since it builds its own filters from this), NativeSearch won't apply in those cases. I'll think about how to also return the full bitset (Lucene's facet module needs this as well) ...
This looks great!
ReplyDelete1) What does AndHighLow do? It's the only task that sees a major regression. Any idea why?
2) Any chance you can port this to Solr? More people could benefit from this work that way since there're probably more people running Solr than raw Lucene.
HI Andy,
DeleteAndHighLow is +foo +bar where foo is very high frequency and bar is very low frequency. It gets slower in the C code because that code does not use advance ... rather, it scans all docs and computes the intersection.
This is easy to fix: I need to put logic up front that uses the normal (Java) path whenever there's a low-frequency MUST clause ... I just haven't done so yet.
On a Solr port, possibly ... but Solr does some strange things, like collecting a top-level bitset for all matching docs as a side-effect of the search; I'd need to put a path for this in C as well, which should not be too bad.
Thanks Michael. Hopefully you could do a Solr port. I'd love to try this out on my Solr setup.
DeleteMy most common use case is faceting with FunctionQuery score boosting. Performance is not very good and latency is erratic. Would lucene-c-boost be able to help?
Hmm, lucene-c-boost as it stands today cannot handle FunctionQuery score boosting: it currently supports only the default (TFIDF) scoring. But ... what is your FunctionQuery computing per hit?
DeleteSolr's faceting is in theory supportable, if we fix lucene-c-boost return the bitset of all hits...
FunctionQuery is used to boost document scores by the log of a "popularity" value.
ReplyDeleteIs that something that lucene-c-boost could potentially support? Thanks.
Is the popularity value in FieldCache/DocValues? If so then it should be easy to get lucene-c-boost to support that.
DeleteDo you just use the popularity value as the entire score? Or do you blend it with the relevance score?
Popularity is just a field of type integer. Is there anything I can do to put it in FieldCache? According to the wiki (http://wiki.apache.org/solr/SolrCaching#The_Lucene_FieldCache) FieldCache is not managed by Solr and there's no config option so not sure what could be done here. Right now it is also not in DocValues. Is it just a case of adding docValues="true" to schema.xml? If so that could be easily done.
ReplyDeletePopularity is combined with relevance score using the !boost function. SOmething similar to http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_change_the_score_of_a_document_based_on_the_.2Avalue.2A_of_a_field_.28say.2C_.22popularity.22.29
It'd be really great if lucene-c-boost works for something like this. It's probably a very common use case.
HI Andy,
ReplyDeleteI'm not sure how to configure Solr to use doc values; this was only added in recent Solr releases. Maybe ask on solr-user@lucene.apache.org?
I'll think about how to allow boosting like this in lucene-c-boost.
Michael,
ReplyDeleteI'll ask about DocValues on the Solr mailing list.
Thanks for considering this use case. Looking forward to new versions of lucene-c-boost.