We had two projects this summer; here I'll describe Han Jiang's project (I was his mentor). The second project, separating
StorableField
and IndexableField
, is also very exciting and wrapping up now.
The goal of Han's project was to create a compelling replacement int-block postings format, improving on the aging variable-int postings format that Lucene has used forever. The project was definitely a success! This morning I merged over the feature branch we had been using all summer, committing to both Lucene's trunk (future 5.x) and Lucene 4.x branch (to be 4.0 GA). This means Lucene 4.0 GA will have the new
BlockPostingsFormat
available.
The new postings format shows incredible search performance gains versus the current default postings format. I tested with luceneutil on the full May 2012 English Wikipedia index (33.3 million documents, 15 GB index):
Task | QPS base | StdDev base | QPS packed | StdDev packed | % change |
---|---|---|---|---|---|
AndHighLow | 353.09 | 9.16 | 317.06 | 2.53 | 13%-7% |
Fuzzy2 | 53.08 | 1.02 | 49.58 | 3.41 | 14%-1% |
Respell | 41.23 | 0.56 | 39.67 | 1.37 | 8%-0% |
PKLookup | 169.41 | 1.94 | 166.69 | 1.57 | 3%-0% |
Fuzzy1 | 33.07 | 0.82 | 35.24 | 0.93 | 1%-12% |
IntNRQ | 3.82 | 0.01 | 4.09 | 0.28 | 0%-14% |
MedSpanNear | 1.46 | 0.03 | 1.60 | 0.03 | 5%-13% |
HighSpanNear | 0.55 | 0.01 | 0.62 | 0.01 | 9%-17% |
LowSpanNear | 2.65 | 0.06 | 3.08 | 0.03 | 12%-19% |
HighSloppyPhrase | 0.63 | 0.02 | 0.74 | 0.02 | 9%-23% |
OrHighHigh | 3.04 | 0.17 | 3.55 | 0.35 | 0%-35% |
HighPhrase | 0.65 | 0.00 | 0.76 | 0.04 | 9%-24% |
OrHighMed | 6.62 | 0.39 | 7.75 | 0.80 | 0%-37% |
LowSloppyPhrase | 2.34 | 0.04 | 2.78 | 0.05 | 14%-23% |
Wildcard | 17.20 | 0.20 | 20.42 | 0.76 | 12%-24% |
OrHighLow | 7.29 | 0.44 | 8.67 | 0.92 | 0%-39% |
AndHighMed | 23.11 | 0.61 | 27.55 | 0.34 | 14%-23% |
LowPhrase | 7.44 | 0.10 | 9.01 | 0.44 | 13%-28% |
Prefix3 | 24.65 | 0.35 | 29.94 | 1.23 | 14%-28% |
MedPhrase | 3.47 | 0.04 | 4.24 | 0.18 | 15%-28% |
MedSloppyPhrase | 2.03 | 0.04 | 2.51 | 0.04 | 19%-28% |
LowTerm | 145.81 | 0.36 | 182.14 | 2.02 | 23%-26% |
MedTerm | 51.64 | 0.25 | 65.57 | 1.14 | 24%-29% |
HighTerm | 8.49 | 0.03 | 10.94 | 0.23 | 25%-32% |
AndHighHigh | 6.52 | 0.19 | 8.70 | 0.19 | 26%-40% |
The gains are awesome! After this change has baked for a while I expect we'll make it the default postings format, maybe in time for 4.1.
This effort has been a long time coming: ever since the creation of flexible indexing, allowing for full pluggability of all files in the index, we have wanted to improve on Lucene's postings format. Early int-block prototypes showed great performance.
The new format stores postings data in fixed blocks of 128 ints. Each block is encoded using a simple packed ints format. The
.doc
file interleaves docID delta and term frequency
blocks, and also includes skip data, while the .pos
file
holds position deltas. Offsets and payloads, if present, are stored
in the .pay
file. On the full English Wikipedia index
used in the above test this resulted in a nice reduction of disk
usage: 13.7 GB (new) vs 14.7 GB (current). That's ~7% smaller.
The format only uses packed int blocks for high-frequency terms: low frequency terms (that appear in less than 128 documents), and the final partial-block of a high frequency term, are stored in the same variable-int format as today's default. Skip points are stored at the start of each block (except the first block).
Han built and tested a number of fun ideas, including skipping within blocks (with partial block decoding), using PForDelta instead of simple packed ints, wrapping with
PulsingPostingsFormat
, all sorts of different ways
to decode packed ints, different values for blockSize
and skipMultiplier
, encoding all-single-values very
compactly, etc. We spent lots of electricity running all sorts of
benchmarks. Some ideas worked and some didn't, and things did not go
according to the original plan! Such is the healthy unpredictable,
zig-zag, iterative nature of open-source development/exploration, yet
in the end we find ourselves with a great new postings format.
Thank you Han Jiang for such an important improvement to Lucene, and thank you Google for sponsoring this effort through Google Summer of Code.
Nice one Mike and great to hear a pure success not only with the code but also with the mentoring of the project as a whole. I suspect that your student Han Jiang will be sticking around in the Lucene circle for a while ;0) I'll look forward to your follow up post if/when the code makes it into 4.1!
ReplyDeleteThanks Mike! It is always enjoyable to discuss with you! I'll be glad to contribute if we need any further optimization!
ReplyDeleteWe just published a paper where we use exactly the same scheme (as far as I can tell). We call it SIMD-BP128.
ReplyDeleteYou might be interested in the following blog post:
Fast integer compression: decoding billions of integers per second
http://lemire.me/blog/archives/2012/09/12/fast-integer-compression-decoding-billions-of-integers-per-second/
You will find a link to our paper as well as our C++ in the blog post.
Though we wrote our software in C++, I ported some schemes in Java:
https://github.com/lemire/JavaFastPFOR
Hi Daniel,
ReplyDeleteVERY interesting! Thank you for writing the paper and sharing your source code under the generous Apache Software License 2. I'll read the paper ... it looks very thorough.
I would love to do our packed bit decoding in C w/ SIMD instructions...
Hey Mike and Han, Kudos to the great work.
ReplyDeleteJust a clarification, does this block posting codec comes along with the 4.0 GA?
Hi Ramprakash,
ReplyDeleteYes, BlockPostingsFormat is available in 4.0, however it's not the default format, and I think the format has changed from 4.0 to 4.1 so if you index with it you'll have to re-index after upgrading.
As of 4.1 it's the default and will have full back compat ...
HI Mike:
ReplyDeleteI have a lucene performance question, my embedded hardware have a bad random io performance. So i want to enlarge the block the postingsFormat block size to load more data in one io, do you think it is resonable.
While based on lucene api introduce(http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29), Large values cause less memory to be used by IndexReader, but slow random-access to terms.
Thanks
Carl Chen