While this is wonderfully simple, it requires an if statement on every byte during decode, which is very costly since the CPU cannot easily predict the branch outcome.
To reduce the number of branches per int decode, you must switch to decoding groups of ints at once. This paper describes a few such encoders (PForDelta, Rice Coding, Simple9, Simple16). Chapter 6 (Index Compression) of Information Retrieval: Implementing and Evaluating Search Engines goes into great detail on these and others, including an addenda showing results for Google's Group VarInt encoder.
The IntBlock codec is designed to make it easy to experiment with these sorts of different int encoders. It does the "hard part" of enabling Lucene to operate on a set of ints at once, which is somewhat tricky as the seek points (in the terms dict and the skipping lists) must now encode both the file-pointer where the block starts as well as the index (starting int) into the block.
There is already a low-level initial implementation Frame-of-reference (FOR) and Patched frame-of-reference (PFOR), on LUCENE-1410, as well as Simple9/16 on LUCENE-2189, thanks to Paul Elschot and Renaud Delbru.
FOR is a simple encoding: it takes each fixed block of ints, ands stores them all as packed ints, where each value gets N bits, set by the maximum int in the block. So say our block size is 8, and we have these ints:
1 7 3 5 6 2 2 5
we'd only need 3 bits per value. But, FOR has a clear weakness: if you have a single big int mixed in with a bunch of small ones, then you waste bits. For example if we have these ints:
1 7 3 5 293 2 2 5
then FOR must use 9 bits for all values (because of 293), despite the fact that the other values only needed 3 bits each. How much this hurts in practice remains to be seen.
PFOR fixes this by marking such large numbers as exceptions, which are then "patched" after the initial decode. So for the above example, we would still use 3 bits per-value, but we'd mark that the 293 value was an "exception" and it would be separately encoded at the end (with more bits), and then "patched" back in at decode time. The above paper has the details.
Note that block-based codecs are not always a good fit, since in order to access even one int within the block, they [typically] must decode the full block. This can be a high cost if you frequently need to decode just an int or two, such as with a primary key field. Pulsing codec is a better fit for such fields.
During testing, I found a few silly performance problems in the IntBlock codec, which I've hacked around for now (I'll post my patch on LUCENE-1410); the hacks are nowhere near committable, but are functioning correctly (identical search results for the queries I'm testing). I also made some more minor optimizations to the FOR/PFOR implementation. I'd like to get the IntBlock codec to the point where anyone can easily tie in a fixed or variable block size int encoding algorithm and run their own tests.
I indexed the first 10 million Wikipedia ~1KB sized docs. The resulting index sizes were:
Codec | Size | % Change |
Standard | 3.50 GB | |
FOR | 3.96 GB | 13.3% bigger |
PFOR | 3.87 GB | 11.1% bigger |
The size increase of both FOR and PFOR are fairly low. PFOR is smaller than FOR, as expected, but it's surprising that it's only a little bit lower. Though, this is a function of where you draw the line for the exception cases and will be corpus dependent. Still, I'm encouraged by how little additional space FOR requires over the Standard codec.
I ran a set of queries and measured the best time (of 40 runs) for each, across 4 threads. Here are the results for FOR:
Query | QPS Standard | QPS FOR | % Change |
united~0.6 | 5.59 | 5.08 | -9.0% |
"united states" | 11.46 | 11.22 | -2.1% |
united~0.7 | 18.33 | 19.23 | 4.9% |
united states | 15.53 | 18.66 | 20.1% |
uni*d | 34.25 | 43.37 | 26.6% |
unit* | 31.27 | 41.07 | 31.3% |
states | 55.72 | 75.82 | 36.1% |
+united +states | 15.17 | 21.43 | 41.2% |
and for PFOR:
Query | QPS Standard | QPS PFOR | % Change |
united~0.6 | 5.59 | 5.02 | -10.1% |
"united states" | 11.46 | 10.88 | -5.1% |
united~0.7 | 18.33 | 19.00 | 3.7% |
united states | 15.53 | 18.11 | 16.6% |
uni*d | 34.25 | 43.71 | 27.6% |
states | 55.72 | 71.39 | 28.1% |
unit* | 31.27 | 41.42 | 32.5% |
+united +states | 15.17 | 21.25 | 40.1% |
They both show good gains for most queries; FOR has a slight edge since it doesn't have to decode exceptions. The united~0.6 (max edit distance = 2) is slower, likely because a good number of the 519 terms it expands to have low frequency, and so the overhead of a full block decode is hurting performance. The PhraseQuery also got a little slower, probably because it uses non-block skipping during searching. The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. Likely a hybrid codec for Lucene, using FOR or FFOR for high frequency terms, Standard for medium frequency, and Pulsing for very low frequency, would perform best overall.
I ran one additional test, this time using FOR on MMapDirectory (instead of NIOFSDirectory used above), passing an IntBuffer from the mapped pages directly to the FOR decoder:
Query | QPS Standard | QPS FOR | % Change |
united~0.6 | 6.00 | 5.28 | -12.1% |
united~0.7 | 18.48 | 20.53 | 11.1% |
"united states" | 10.47 | 11.70 | 11.8% |
united states | 15.21 | 20.25 | 33.2% |
uni*d | 33.16 | 47.59 | 43.5% |
unit* | 29.71 | 45.14 | 51.9% |
+united +states | 15.14 | 23.65 | 56.2% |
states | 52.30 | 88.66 | 69.5% |
The results are generally even better for two reasons. First, for some reason the Standard codec slows down a bit with MMapDirectory. Second, the FOR codec speeds up substantially, because I'm able (hacked up though) do a no-copy decode with MMapDirectory.
Remember that these results are very preliminary, based on a patch with many non-committable hacks (eg deleted docs are ignored!), that doesn't pass all Lucene's unit tests, etc. There are also still further optimizations to explore, and lots of work remains. So it's not clear where the final numbers will be, but these initial results are very encouraging!
If you have any other algorithms to try for encoding ints, pleaes pipe up! It's very easy to have Lucene generate a few billion ints for you to encode/decode if you want to run standalone tests.
Hi
ReplyDeleteI tried integrating PForDelta into lucene 2.9 but confronted a problem.
I use the implementation in http://code.google.com/p/integer-array-compress-kit/
it implements a basic PForDelta algorithm and an improved one(which called NewPForDelta, but there are many bugs and I have fixed them),
But compare it with VInt and S9, it's speed is very slow when only decode small number of integer arrays.
e.g. when I decoded int[256] arrays which values are randomly generated between 0 and 100, if decode just one array. PFor(or NewPFor) is very slow. when it continuously decodes many arrays such as 10000, it's faster than s9 and vint.
Another strange phenomena is that when call PFor decoder twice, the 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor is faster. reverse the call sequcence, the 2nd called decoder is faster
e.g.
ct.testNewPFDCodes(list);
ct.testPFor(list);
ct.testVInt(list);
ct.testS9(list);
NewPFD decode: 3614705
PForDelta decode: 17320
VINT decode: 16483
S9 decode: 19835
when I call by the following sequence
ct.testPFor(list);
ct.testNewPFDCodes(list);
ct.testVInt(list);
ct.testS9(list);
PForDelta decode: 3212140
NewPFD decode: 19556
VINT decode: 16762
S9 decode: 16483
how did you integrate PFor into lucene? my implementation is -- group docIDs and termDocFreqs into block which contains 128 integers. when SegmentTermDocs's next method called(or read readNoTf).it decodes a block and save it to a cache.
Hi fancyerii,
ReplyDeleteLooks like you also posted on Lucene/Solr's dev list -- I just responded there.
On how I integrated PFor into Lucene, the patch here was hacked up, but we are working on the "real" integration under LUCENE-2723, which should make intblock codecs easy and high-performance to plug into Lucene as a Codec.
Mike
Well, I found my way here searching for "PForDelta", and how interesting to find that someone is looking at it in relation to Lucene itself, which I also use. :)
ReplyDeleteThis post makes me curious about http://code.google.com/p/javaewah/ and http://ricerca.mat.uniroma3.it/users/colanton/concise.html .. in particular, concise seems to be nice for performing set operations.
Anonymous,
ReplyDeleteThat's great! We have a branch off Lucene's trunk right now to improve how intblock codecs integrate into Lucene's low level postings enumeration APIs...
Those two packages look very interesting; thanks for sharing.