Changing Bits: August 2010

Those of you with kids have undoubtedly discovered that sometimes they sleep with their eyes half open.

The first time it happens it's somewhat alarming, but, then, you get used to it and realize it's "normal". It probably just means their eyelids haven't quite closed yet.

Well, my one-year-old did this, yesterday, so I seized the opportunity to do an experiment!! I wanted to test just how asleep she was, so, I smiled like crazy, made funny faces, scrunched my nose, stuck my tongue out, at her half open eye.

And get this: she laughed! In her sleep. I repeated the test 3 times and each time she laughed. Yet, she was most definitely "asleep" -- the eyelid eventually closed and she didn't wake up for another hour.

I guess enough of the brain remains active to process what the eye is seeing even when the rest of the brain is sleeping? Wild!

Before you think I'm being unfair, she also conducts sleep experiments on us! Last weekend the whole family went camping. It was bed time and, as usual, we were all exhausted except our one-year-old. There was enough low light so she could see, so, as we were all dropping off one by one, she took turns running back and forth and pulling hard on our chins to wake us up again! Probably, from her perspective, this was like a horror movie, where her family is one by one dropping into comas and it's her job to prevent the disaster!

Kids are great fun!

Today, to encode the postings (docs, freqs, positions) in the index, Lucene uses a variable byte format where each integer is individually encoded as 1-5 bytes.

While this is wonderfully simple, it requires an if statement on every byte during decode, which is very costly since the CPU cannot easily predict the branch outcome.

To reduce the number of branches per int decode, you must switch to decoding groups of ints at once. This paper describes a few such encoders (PForDelta, Rice Coding, Simple9, Simple16). Chapter 6 (Index Compression) of Information Retrieval: Implementing and Evaluating Search Engines goes into great detail on these and others, including an addenda showing results for Google's Group VarInt encoder.

The IntBlock codec is designed to make it easy to experiment with these sorts of different int encoders. It does the "hard part" of enabling Lucene to operate on a set of ints at once, which is somewhat tricky as the seek points (in the terms dict and the skipping lists) must now encode both the file-pointer where the block starts as well as the index (starting int) into the block.

There is already a low-level initial implementation Frame-of-reference (FOR) and Patched frame-of-reference (PFOR), on LUCENE-1410, as well as Simple9/16 on LUCENE-2189, thanks to Paul Elschot and Renaud Delbru.

FOR is a simple encoding: it takes each fixed block of ints, ands stores them all as packed ints, where each value gets N bits, set by the maximum int in the block. So say our block size is 8, and we have these ints:


  1 7 3 5 6 2 2 5

we'd only need 3 bits per value. But, FOR has a clear weakness: if you have a single big int mixed in with a bunch of small ones, then you waste bits. For example if we have these ints:


  1 7 3 5 293 2 2 5

then FOR must use 9 bits for all values (because of 293), despite the fact that the other values only needed 3 bits each. How much this hurts in practice remains to be seen.

PFOR fixes this by marking such large numbers as exceptions, which are then "patched" after the initial decode. So for the above example, we would still use 3 bits per-value, but we'd mark that the 293 value was an "exception" and it would be separately encoded at the end (with more bits), and then "patched" back in at decode time. The above paper has the details.

Note that block-based codecs are not always a good fit, since in order to access even one int within the block, they [typically] must decode the full block. This can be a high cost if you frequently need to decode just an int or two, such as with a primary key field. Pulsing codec is a better fit for such fields.

During testing, I found a few silly performance problems in the IntBlock codec, which I've hacked around for now (I'll post my patch on LUCENE-1410); the hacks are nowhere near committable, but are functioning correctly (identical search results for the queries I'm testing). I also made some more minor optimizations to the FOR/PFOR implementation. I'd like to get the IntBlock codec to the point where anyone can easily tie in a fixed or variable block size int encoding algorithm and run their own tests.

I indexed the first 10 million Wikipedia ~1KB sized docs. The resulting index sizes were:

Codec	Size	% Change
Standard	3.50 GB
FOR	3.96 GB	13.3% bigger
PFOR	3.87 GB	11.1% bigger

The size increase of both FOR and PFOR are fairly low. PFOR is smaller than FOR, as expected, but it's surprising that it's only a little bit lower. Though, this is a function of where you draw the line for the exception cases and will be corpus dependent. Still, I'm encouraged by how little additional space FOR requires over the Standard codec.

I ran a set of queries and measured the best time (of 40 runs) for each, across 4 threads. Here are the results for FOR:

Query	QPS Standard	QPS FOR	% Change
united~0.6	5.59	5.08	-9.0%
"united states"	11.46	11.22	-2.1%
united~0.7	18.33	19.23	4.9%
united states	15.53	18.66	20.1%
uni*d	34.25	43.37	26.6%
unit*	31.27	41.07	31.3%
states	55.72	75.82	36.1%
+united +states	15.17	21.43	41.2%

and for PFOR:

Query	QPS Standard	QPS PFOR	% Change
united~0.6	5.59	5.02	-10.1%
"united states"	11.46	10.88	-5.1%
united~0.7	18.33	19.00	3.7%
united states	15.53	18.11	16.6%
uni*d	34.25	43.71	27.6%
states	55.72	71.39	28.1%
unit*	31.27	41.42	32.5%
+united +states	15.17	21.25	40.1%

They both show good gains for most queries; FOR has a slight edge since it doesn't have to decode exceptions. The united~0.6 (max edit distance = 2) is slower, likely because a good number of the 519 terms it expands to have low frequency, and so the overhead of a full block decode is hurting performance. The PhraseQuery also got a little slower, probably because it uses non-block skipping during searching. The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. Likely a hybrid codec for Lucene, using FOR or FFOR for high frequency terms, Standard for medium frequency, and Pulsing for very low frequency, would perform best overall.

I ran one additional test, this time using FOR on MMapDirectory (instead of NIOFSDirectory used above), passing an IntBuffer from the mapped pages directly to the FOR decoder:

Query	QPS Standard	QPS FOR	% Change
united~0.6	6.00	5.28	-12.1%
united~0.7	18.48	20.53	11.1%
"united states"	10.47	11.70	11.8%
united states	15.21	20.25	33.2%
uni*d	33.16	47.59	43.5%
unit*	29.71	45.14	51.9%
+united +states	15.14	23.65	56.2%
states	52.30	88.66	69.5%

The results are generally even better for two reasons. First, for some reason the Standard codec slows down a bit with MMapDirectory. Second, the FOR codec speeds up substantially, because I'm able (hacked up though) do a no-copy decode with MMapDirectory.

Remember that these results are very preliminary, based on a patch with many non-committable hacks (eg deleted docs are ignored!), that doesn't pass all Lucene's unit tests, etc. There are also still further optimizations to explore, and lots of work remains. So it's not clear where the final numbers will be, but these initial results are very encouraging!

If you have any other algorithms to try for encoding ints, pleaes pipe up! It's very easy to have Lucene generate a few billion ints for you to encode/decode if you want to run standalone tests.

Changing Bits

Tuesday, August 24, 2010

Here comes another bubble

Saturday, August 7, 2010

Do babies really sleep?

Wednesday, August 4, 2010

Java developers are not consenting adults

Monday, August 2, 2010

Lucene performance with the PForDelta codec