Monday, August 2, 2010

Lucene performance with the PForDelta codec

Today, to encode the postings (docs, freqs, positions) in the index, Lucene uses a variable byte format where each integer is individually encoded as 1-5 bytes.

While this is wonderfully simple, it requires an if statement on every byte during decode, which is very costly since the CPU cannot easily predict the branch outcome.

To reduce the number of branches per int decode, you must switch to decoding groups of ints at once. This paper describes a few such encoders (PForDelta, Rice Coding, Simple9, Simple16). Chapter 6 (Index Compression) of Information Retrieval: Implementing and Evaluating Search Engines goes into great detail on these and others, including an addenda showing results for Google's Group VarInt encoder.

The IntBlock codec is designed to make it easy to experiment with these sorts of different int encoders. It does the "hard part" of enabling Lucene to operate on a set of ints at once, which is somewhat tricky as the seek points (in the terms dict and the skipping lists) must now encode both the file-pointer where the block starts as well as the index (starting int) into the block.

There is already a low-level initial implementation Frame-of-reference (FOR) and Patched frame-of-reference (PFOR), on LUCENE-1410, as well as Simple9/16 on LUCENE-2189, thanks to Paul Elschot and Renaud Delbru.

FOR is a simple encoding: it takes each fixed block of ints, ands stores them all as packed ints, where each value gets N bits, set by the maximum int in the block. So say our block size is 8, and we have these ints:

1 7 3 5 6 2 2 5

we'd only need 3 bits per value. But, FOR has a clear weakness: if you have a single big int mixed in with a bunch of small ones, then you waste bits. For example if we have these ints:

1 7 3 5 293 2 2 5

then FOR must use 9 bits for all values (because of 293), despite the fact that the other values only needed 3 bits each. How much this hurts in practice remains to be seen.

PFOR fixes this by marking such large numbers as exceptions, which are then "patched" after the initial decode. So for the above example, we would still use 3 bits per-value, but we'd mark that the 293 value was an "exception" and it would be separately encoded at the end (with more bits), and then "patched" back in at decode time. The above paper has the details.

Note that block-based codecs are not always a good fit, since in order to access even one int within the block, they [typically] must decode the full block. This can be a high cost if you frequently need to decode just an int or two, such as with a primary key field. Pulsing codec is a better fit for such fields.

During testing, I found a few silly performance problems in the IntBlock codec, which I've hacked around for now (I'll post my patch on LUCENE-1410); the hacks are nowhere near committable, but are functioning correctly (identical search results for the queries I'm testing). I also made some more minor optimizations to the FOR/PFOR implementation. I'd like to get the IntBlock codec to the point where anyone can easily tie in a fixed or variable block size int encoding algorithm and run their own tests.

I indexed the first 10 million Wikipedia ~1KB sized docs. The resulting index sizes were:

CodecSize% Change
Standard3.50 GB 
FOR3.96 GB13.3% bigger
PFOR3.87 GB11.1% bigger

The size increase of both FOR and PFOR are fairly low. PFOR is smaller than FOR, as expected, but it's surprising that it's only a little bit lower. Though, this is a function of where you draw the line for the exception cases and will be corpus dependent. Still, I'm encouraged by how little additional space FOR requires over the Standard codec.

I ran a set of queries and measured the best time (of 40 runs) for each, across 4 threads. Here are the results for FOR:

QueryQPS StandardQPS FOR% Change
united~0.65.595.08-9.0%
"united states"11.4611.22-2.1%
united~0.718.3319.234.9%
united states15.5318.6620.1%
uni*d34.2543.3726.6%
unit*31.2741.0731.3%
states55.7275.8236.1%
+united +states15.1721.4341.2%

and for PFOR:

QueryQPS StandardQPS PFOR% Change
united~0.65.595.02-10.1%
"united states"11.4610.88-5.1%
united~0.718.3319.003.7%
united states15.5318.1116.6%
uni*d34.2543.7127.6%
states55.7271.3928.1%
unit*31.2741.4232.5%
+united +states15.1721.2540.1%

They both show good gains for most queries; FOR has a slight edge since it doesn't have to decode exceptions. The united~0.6 (max edit distance = 2) is slower, likely because a good number of the 519 terms it expands to have low frequency, and so the overhead of a full block decode is hurting performance. The PhraseQuery also got a little slower, probably because it uses non-block skipping during searching. The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. Likely a hybrid codec for Lucene, using FOR or FFOR for high frequency terms, Standard for medium frequency, and Pulsing for very low frequency, would perform best overall.

I ran one additional test, this time using FOR on MMapDirectory (instead of NIOFSDirectory used above), passing an IntBuffer from the mapped pages directly to the FOR decoder:

QueryQPS StandardQPS FOR% Change
united~0.66.005.28-12.1%
united~0.718.4820.5311.1%
"united states"10.4711.7011.8%
united states15.2120.2533.2%
uni*d33.1647.5943.5%
unit*29.7145.1451.9%
+united +states15.1423.6556.2%
states52.3088.6669.5%

The results are generally even better for two reasons. First, for some reason the Standard codec slows down a bit with MMapDirectory. Second, the FOR codec speeds up substantially, because I'm able (hacked up though) do a no-copy decode with MMapDirectory.

Remember that these results are very preliminary, based on a patch with many non-committable hacks (eg deleted docs are ignored!), that doesn't pass all Lucene's unit tests, etc. There are also still further optimizations to explore, and lots of work remains. So it's not clear where the final numbers will be, but these initial results are very encouraging!

If you have any other algorithms to try for encoding ints, pleaes pipe up! It's very easy to have Lucene generate a few billion ints for you to encode/decode if you want to run standalone tests.

Wednesday, July 14, 2010

Moving readVInt to C

By far the hottest spot in Lucene during searching is the method (DataInput.readVInt) that decodes Lucene's variable-length integer representation (vInt). This method is called an insane number of times, while iterating the postings lists (docs, freqs, positions), during query execution.

This representation is not CPU friendly: for every single byte read, the CPU hits a hard-to-predict if statement (testing the high bit). Alternative block-based formats do a good job reducing branches while decoding, such as these encodings (PFOR-DELTA is currently an experimental codec attached as a patch on LUCENE-1410) or Google's Group Varint.

Anyway, I decided to test whether porting readVInt to C would gives us a performance boost. I had known the overhead of JNI was highish in the past, but I was hoping in modern JREs this had been improved.

Unfortunately, it wasn't: I see a ~10%-~40% slowdown, depending on the query. Likely hotspot's inability to inline native methods is also a factor here. Perhaps, if/when we switch to a block-based codec, we'll see a gain with a native implementation, since the amortized overhead of invoking JNI will be lower.

Tuesday, July 13, 2010

Apple censorship, again

It's sickening that Apple thinks it's OK to censor (remove) posts from their forums. It's not the first time they've done this, either.

I consider it entrapment -- Apple hosts these forums, as the obvious place where we all can go to discuss any issues we have. As with any forum, there's a natural implicit assumption of non-interference, much like I expect my ISP not to block my visits to random web-sites or my email provider to block certain emails or the local coffee shop to disallow discussions about certain topics.

But then, suddenly, Apple acts like China: they censor that which they disagree with.

C'mon Apple!

You should be above such draconian behavior. If you disagree with what's being said, play fair! Come out and discuss the topic, explain your viewpoint, convince us.

Saturday, July 10, 2010

What motivates us

Great video, from Daniel Pink, animated by RSA, summarizing his book digging into what motivates people to do things. I find it very accurate about myself, and I expect especially people working in open-source projects will resonate strongly with the surprising findings.

Encoding videos for Apple iPad/iPod, using mencoder

After much searching, poking, iterating, I found a magical mencoder command-line that encodes videos in the right format for the ipad. Debugging this is hard because, if you get something wrong, iTunes simply silently fails to open the video file, giving no specific feedback as to what's wrong. I felt like a monkey on a typewriter...

So here are the options, broken by group. Set the output format to MP4:

-of lavf -lavfopts format=mp4

Scale the video to width 1280, height to match the aspect ratio (if your source video is low resolution, eg DVD, leave this off; you can also crop (-vf crop=W:H:X:Y; run mplayer -vf cropdetect,scale to have it guess the crop for you), or de-interlance (I like -vf pp=lb), or any other video filter):

-vf scale=1280:-3

Use x264 codec for video, with a zillion options (tweak that crf=XX if you want higher quality; lower values of XX give higher quality, but take more space):

-ovc x264 -x264encopts crf=28:vbv_maxrate=1500:nocabac:global_header:frameref=3:threads=auto:bframes=0:subq=6:mixed-refs=0:weightb=0:8x8dct=1:me=umh:partitions=all:qp_step=4:qcomp=0.7:trellis=1:direct_pred=auto

Use faac codec for audio:

-oac faac -faacopts br=160:mpeg=4:object=2:raw -channels 2 -srate 48000

I'm sure many other command-line combinations would work; this was just the first combo I found that works.

If you need to encode for the iPod Touch, just change the -vf scale=1280:-3 to -vf scale=640:-3 (at least this works for recent iPod Touch models).

You can also use the nifty mp4art tool to insert cover art after the encode finishes; this lets you set the image for the movie when browsing on the iPad/iPod or in iTunes, which is vital if young kids will be launching the movies! I use PNG images, which work fine.

Friday, July 9, 2010

Old Orchard Beach

We spent the past half week in Old Orchard Beach, Maine (US), in the Crest Motel. It's a great place! The rooms are clean, the beach is right outside, and it's a short walk to Palace Playland. They have an indoor pool & hot tub, and I'm happy to report that they do not discriminate against kids. The whole area is very family friendly, and the kids had a blast!

But what I loved most was... while we were staying there, they happened to be installing Solar hot water panels! At first we saw random holes drilled in the floor and ceiling, which made all of us very curious. Then the next day these were filled in with insulated pipes. Finally, this morning, we saw the solar panels themselves, being carried up and installed on the roof. The location is excellent -- they have a large roof, and no tall trees anywhere nearby. It looked like a high capacity installation, maybe ~250 square feet. They expect to fully generate all hot water in the summer, and also a sizable portion of the (air) heat in the winter.

I have a small (600W nominal, off-grid) solar electric installation in my house, self-installed. In the summer, it easily offsets the electricity used by our home theater. I involved the kids all throughout the installation, to teach them how important it is to find renewable sources for everything. And whenever they or their friends watch movies, I remind them that it's all powered by sunshine electricity, which of course leads to a great curious discussion and more teachable moments.

So I took this chance to (again) hammer home this vital lesson. And, I'm very happy to see that others are adopting Solar energy!

Communism vs capitialism/democracy

While we in the "free" world like to believe our capitalistic, democratic way is better than the tight-fisted communist approach in, say, China, the situation is not really so clear cut.

The Chinese government has full control to make changes to nearly everything in the country. Yes, sometimes this power is used to awful ends, such as human rights violations and the great firewall of China.

But then, this same unilateral power can lead to great progress, overnight. For example, China plans to add a 5% tax on oil and gas consumption (plus other "raw materials"). The US, in contrast, has for a very long time offered massive tax breaks to the oil companies (this is why the price of our gasoline is horribly low compared to nearly every other country, encouraging us to buy massive, awfully fuel inefficient cars and trucks).

Another example: a few years back, China suddenly required that all cell phone chargers adopt the same standard, so that when people buy new phones the do not need a new charger, thus eliminating a big source of electronics waste. In contrast, in the US with our "new every 2", we have to throw away our chargers every time we upgrade since they rarely interoperate.

A third example is fuel economy standards: China has for a long time had far more stringent requirements than the US.

In contrast, accomplishing these excellent improvements in the US is nearly impossible: each small change requires a tremendous battle through our congress, many members of which are rather blatantly corrupt, accepting all sorts of creative bribes (campaign contributions) from corporations, to buy their vote. And corporate influence got even stronger thanks the recent Supreme Court landmark decision giving corporations much more freedom to influence elections. Finally, congress members, especially these days, seem to vote almost always along party lines rather than what's actually best for our country's future.

Don't get me wrong: net/net I'm quite happy I live in the US, given all the tradeoffs. But, still, we can and should do better, and copying some of China's recent changes would be a great start.

Monday, July 5, 2010

A better grass

This Pearl's premium grass looks awesome: mow only once per month; extremely drought tolerant (seldom or never water); thrives without chemicals. It's the polar opposite of life-support grass, that's the gold standard in today's yards.

Maybe, once our yard is in really bad shape, I'll try this seed, mixed with clover. I want our yard to have zero environmental cost, zero carbon footprint. We currently have just that: we haven't watered our lawn in 2 years; no fertilizer, pesticides; no dethatching, mechanical aeration; very rarely raked, mowed; etc., except, it's slowly dying, and weeds are moving in, because we inherited prior life-support grass.

I'd also like near-zero effort on our part (we don't want to hire a lawn care service because we feel that sets a bad example for our kids), and of course to have decent curb appeal. Surely this is not too much to expect?

I wish there were more companies/people developing environmentally free/friendly alternatives to the standard life-support grass. I'd do a bake-off with these alternatives on different spots in our challenging yard!

Ethernet lightning protection

I'm going to try using this APC ethernet surge suppressor to protect the important devices (desktop, laptop, filer, backups, FIOS) attached to our ethernet network.

Lucene's RAM usage for searching

For fast searching, Lucene loads certain data structures entirely into RAM:

  • The terms dict index requires substantial RAM per indexed term (by default, every 128th unique term), and is loaded when IndexReader is created. This can be a very large amount of RAM for indexes that have an unusually high number of unique terms; to reduce this, you can pass a terms index divisor when opening the reader. For example, passing 2, which loads only every other indexed term, halves the RAM required. But, in tradeoff, seeking to a given term, which is required once for every TermQuery, will become slower as Lucene must do twice as much scanning (on average) to find the term.

  • Field cache, which is used under-the-hood when you sort by a field, takes some amount of per-document RAM depending on the field type (String is by far the worst). This is loaded the first time you sort on that field.

  • Norms, which encode the a-priori document boost computed at indexing time, including length normalization and any boosting the app does, consume 1 byte per field X document used for searching. For example, if your app searches 3 different fields, such as body, title and abstract, then that requires 3 bytes of RAM, per document. These are loaded on-demand the first time that field is searched.

  • Deletions, if present, consume 1 bit per doc, created during IndexReader construction.

Warming a reader is necessary because of the data structures that are initialized lazily (norms, FieldCache). It's also useful to pre-populate the OS's IO cache with those pages that cover the frequent terms you're searching on.

With flexible indexing, available in Lucene's trunk (4.0-dev), we've made great progress on reducing the RAM required for both the terms dict index and the String index field cache (some details here). We have substantially reduced the number of objects created for these RAM resident data structures, and switched to representing all character data as UTF8, not java's char, which halves the RAM required when the character data is simple ascii.

So, I ran a quick check against a real index, created from the first 5 million documents from the Wikipedia database export. The index has a single segment with no deletions. I initialize a searcher, and then load norms for the body field, and populate the FieldCache for sorting by the title field, using JRE 1.6, 64bit:

  • 3.1-dev requires 674 MB of RAM

  • 4.0-dev requires 179 MB of RAM

That's a 73% reduction on RAM required!

However, there seems to be some performance loss when sorting by a String field, which we are still tracking down.

Note that modern OSs will happily swap out RAM from a process, in order to increase the IO cache. This is rather silly: Lucene loads these specific structures into RAM because we know we will need to randomly access them, a great many times. Other structures, like the postings data, we know we will sweep sequentially once per search, so it's less important that these structures be in RAM. When the OS swaps our RAM out in favor of IO cache, it's reversing this careful separation!

This will of course cause disastrous search latency for Lucene, since many page faults may be incurred on running a given search. On Linux, you can fix this by tuning swappiness down to 0, which I try to do on every Linux computer I touch (most Linux distros default this to a highish number). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.