Friday, December 3, 2010

Using Finite State Transducers in Lucene

FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. They also look cool:



That FST maps the sorted words mop, moth, pop, star, stop and top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.

Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.

There's a great deal of theory behind FSTs. They generally support the same operations as FSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.

There are some nice general-purpose FST toolkits (OpenFst looks great) that support all these operations, but for Lucene I decided to implement this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.

The resulting implementation (currently a patch on LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a prefix trie, pruning by how many terms come through each node, with even less memory.

Note that because addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.

In the patch I modified the SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!

There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!

Many thanks to Dawid Weiss for helping me iterate on this.

Wednesday, November 3, 2010

Big Fat Fiasco

I just watched this talk by Tom Naughton, describing the mis-steps and bad science over the years that have tricked us all into believing we should avoid fat and cholesterol in our diet when in fact we should be doing the exact opposite!

Tom is an excellent speaker (he's a comedian!), mixing in humor in what is at heart a very sad topic. He details the science that should have led us to conclude that excessive carbs and sugar consumption are in fact the root cause behind heart disease, diabetes and obesity, not fat and cholesterol as we've been incessantly told over the years.

He shows how the "Fat Lazy Slob Theory" (also called "calories in minus calories out") so frequently put forth to explain weight gain is in fact wrong, and that instead the root cause is the biochemistry in your body driving you to eat when your blood sugar is too high.

Tom's documentary movie Fat Head, which is getting great reviews on Amazon, delves into these same topics. I haven't watched it yet but I plan to.

So enjoy your butter, cheese, whole milk, eggs (with yolk!!) and cut back on sugars and carbs. And don't drink soda!

Monday, October 25, 2010

Our medical system is a house of cards

I just came across this great article about meta-researcher Dr. John Ioannidis. Here's the summary:

Much of what medical researchers conclude in their studies is misleading, exaggerated, or flat-out wrong. So why are doctors—to a striking extent—still drawing upon misinformation in their everyday practice? Dr. John Ioannidis has spent his career challenging his peers by exposing their bad science.

The gist is that modern medical research is deeply flawed and biased such that the "conclusions" that you and I eventually read in the news headlines are often false. I especially love his advice for us all:

Ioannidis suggests a simple approach: ignore them all

This is in fact my approach! I have a simple rule: if it tastes good it's good for you. So I eat plenty of fat, salt, sugar, cholesterol, carbs, etc. I love eggs and cheese and I always avoid low-fat or low-cholesterol foods. I get lots of sun and never use sun screen. I drink coffee and beer, daily. I drink lots of water. I get daily exercise, running and walking. And I avoid hand sanitizers like Purell (I believe commonplace dirt/germs are in fact natural and good for you). I strongly believe humans do not need pills to stay healthy. I don't take a daily vitamin. And I'm very healthy!

This short interview between Discover Magazine and Harvard clinician John Abramson echoes the same core problem. Here's a choice quote:

When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.

Unfortunately, the medical world has a deep, deep conflict of interest: healthy people do not generate profits. Capitalism is a horrible match to health care.

So, next time your doctor prescribes a fancy new cool-sounding powerful drug like Alevia or Omosia or Nanotomopia or whatever, try to remember that our medical system is really built on a house of cards. Your doctor, let alone you, cannot possibly differentiate what's true from what's false. Don't trust that large triple-blind random controlled trial that supposedly validated this cool new drug. You are the guinea pig! And it's only when these drugs cause all sorts of problems once they are really tested on the population at large that their true colors are revealed.

Sunday, October 17, 2010

Pics from BBQ after Lucene Revolution

I finally pulled the pics off my camera from last week's BBQ after Lucene Revolution in Boston, where much fun was had! See them here. It was awesome to finally meet everyone!

Saturday, October 9, 2010

Fun with flexible indexing

The Lucene Revolution conference just wrapped up yesterday. It was well attended (~300 or so people). It was great fun to hear about all the diverse ways that Lucene and Solr are being used in the real world.

I gave a talk about flexible indexing, coming in the next major release of Lucene (4.0). Slides are here.

Tuesday, October 5, 2010

Lucene's SimpleText codec

Inspired by this question on the Lucene user's list, I created a new codec in Lucene called the SimpleText codec. The best ideas come from the user's lists!

This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0). Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...

Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:

field contents
term file
doc 0
pos 5
term is
doc 0
pos 1
term second
doc 0
pos 3
term test
doc 0
pos 4
term the
doc 0
pos 2
term this
doc 0
pos 0
END


The codec is read/write, and fully functional. All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).

Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc. So don't use this one in production!

But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.

Tuesday, September 28, 2010

Track your home's live electricity usage with Python

Modern electric meters (the one attached to the outside of your house) emit an IR flash for each watt-hour consumed, through the port on the top of the meter. It turns out, it's easy to detect this flash, decode it into "live" power usage, and make cool charts like this:

That's live KW on the Y axis and time on the X axis.

The flash, at least for my meter, appears to have high temporal accuracy, meaning it flashes precisely as you cross 1.000 WH. This is great, since it enables accurate, real-time power usage. For example, when I turn on the lights in my home office, I quickly see the power jump by ~65 watts, and then drop again when I turn the lights off.

This is a fun way to track down which appliances or crazy computers or parasitic drains are using up so much electricity in your house!

I've gone through several designs for this over the years. My last attempt was destroyed when lightning hit my house (there's always a silver lining!), so, I decided to try something new this time and I'm very happy with the approach since it's much simpler, and, moves the pulse detection into Python.

I use a trivial analog circuit to detect the IR pulse, consisting of two 100K resistors in series and an IR photodiode in parallel with one of the resistors. Make sure you get the polarity right on the photodiode otherwise it won't work! Mount the photodiode on your power meter, aligned so that it "sees" each IR pulse. I also added a small (0.01 uF) ceramic capacitor in parallel with the same resistor, to suppress transient EM fields otherwise picked up by the relatively long wires out to my meter.

Finally, I use this simple USB audio adapter to move the problem into the digital domain, connecting to the ends of the two series resistors to the mic input to use it for the A/D conversion. This USB audio adapter drives the mic input with ~4.0V bias voltage, which is great since otherwise you'd need an external source.

When there's no pulse, the photo-diode acts roughly like an open switch, meaning there is a fixed 200K resistive load on the mic input. When a pulse happens (mine seem to last for ~10 msec), the photo-diode acts roughly like a closed switch, suddenly dropping the series resistance to 100K. This drop causes a very detectible pulse (strongly negative then strongly positive) on the mic input's voltage, I suspect because there's a capacitor behind the bias voltage (but: I am no analog circuit engineer, so this is speculation!).

You can plug this USB audio adapter into any computer; I use a Sheeva plug computer (delightful device, very low power -- I have three!). I record the digital samples (arecord works well, at a 2 Khz rate) and decode the samples in Python to detect a pulse whenever the value drops below -1000. You can easily compute the live KW based on the time between two adjacent pulses, push this into a database, and build graphs on top of this using Google's visualization APIs (I use dygraphs).

My last approach didn't have nearly the temporal accuracy (ie, it smoothed heavily across time), which masks interesting things. For example, now I can tell the difference between a resistive load (the coffee maker, oven, crockpot) and an inductive load (refrigerator compressor, vacuum cleaner) because the inductive load has a huge spike in the beginning, as the motor consumes lots of power trying to spin up.

Friday, September 17, 2010

Lucene's indexing is fast!


Wikipedia periodically exports all of the content on their site, providing a nice corpus for performance testing. I downloaded their most recent English XML export: it uncompresses to a healthy 21 GB of plain text! Then I fully indexed this with Lucene's current trunk (to be 4.0): it took 13 minutes and 9 seconds, or 95.8 GB/hour -- not bad!

Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost. Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc. I use Lucene's contrib/benchmark package to do the indexing; here's the alg I used:

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

content.source = org.apache.lucene.benchmark.byTask.feeds.LineDocSource
docs.file = /lucene/enwiki-20100904-pages-articles.txt

doc.stored = true
doc.term.vector = false
doc.tokenized = false
doc.body.stored = false
doc.body.tokenized = true

log.step.AddDoc=10000

directory=FSDirectory
compound=false
ram.flush.mb = 256

work.dir=/lucene/indices/enwiki

content.source.forever = false

CreateIndex

{ "BuildIndex"
[ { "AddDocs" AddDoc > : * ] : 6
- CloseIndex
}

RepSumByPrefRound BuildIndex

There is no field truncation taking place, since this is now disabled by default -- every token in every Wikipedia article is being indexed. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them. I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. The index only has 4 fields -- title, date, body, and docid.

I've done a few things to speed up the indexing:
  • Increase IndexWriter's RAM buffer from the default 16 MB to 256 MB
  • Run with 6 threads
  • Disable compound file format
  • Reuse document/field instances (contrib/benchmark does this by default)
Lucene's wiki describes additional steps you can take to speed up indexing.

Both the source lines file and index are on an Intel X25-M SSD, and I'm running it on a modern machine, with dual Xeon X5680s, overclocked to 4.0 Ghz, with 12 GB RAM, running Fedora Linux. Java is 64bit 1.6.0_21-b06, and I run as java -server -Xmx2g -Xms2g. I could certainly give it more RAM, but it's not really needed. The resulting index is 6.9 GB.

Out of curiosity, I made a small change to contrib/benchmark, to print the ingest rate over time. It looks like this (over a 100-second window):



Note that a large part (slightly over half!) of the time, the ingest rate is 0; this is not good! This happens because the flushing process, which writes a new segment when the RAM buffer is full, is single-threaded, and, blocks all indexing while it's running. This is a known issue, and is actively being addressed under LUCENE-2324.

Flushing is CPU intensive -- the decode and reencode of the great many vInts is costly. Computers usually have big write caches these days, so the IO shouldn't be a bottleneck. With LUCENE-2324, each indexing thread state will flush its own segment, privately, which will allow us to make full use of CPU concurrency, IO concurrency as well as concurrency across CPUs and the IO system. Once this is fixed, Lucene should be able to make full use of the hardware, ie fully saturate either concurrent CPU or concurrent IO such that whichever is the bottleneck in your context gates your ingest rate. Then maybe we can double this already fast ingest rate!

Monday, September 13, 2010

Fast search filters using flex

A filter in Lucene is a bit set that restricts the search space for any query; you pass it into IndexSearcher's search method. It's effective for a number of use cases, such as document security, index partitions, facet drill-down, etc.

To apply a filter, Lucene must compute the intersection of the documents matching the query against the documents allowed by the filter. Today, we do that in IndexSearcher like this:

while (true) {
if (scorerDoc == filterDoc) {
// Check if scorer has exhausted, only before collecting.
if (scorerDoc == DocIdSetIterator.NO_MORE_DOCS) {
break;
}
collector.collect(scorerDoc);
filterDoc = filterIter.nextDoc();
scorerDoc = scorer.advance(filterDoc);
} else if (scorerDoc > filterDoc) {
filterDoc = filterIter.advance(scorerDoc);
} else {
scorerDoc = scorer.advance(filterDoc);
}
}

We call this the "leapfrog approach": the query and the filter take turns trying to advance to each other's next matching document, often jumping past the target document. When both land on the same document, it's collected.

Unfortunately, for various reasons this implementation is inefficient (these are spelled out more in LUCENE-1536):
  1. The advance method for most queries is costly.
  2. The advance method for most filters is usually cheap.
  3. If the number of documents matching the query is far higher than
    the number matching the filter, or vice versa, it's better to drive
    the matching by whichever is more restrictive.
  4. If the filter supports fast random access, and is not super
    sparse, it's better to apply it during postings enumeration, like
    deleted docs.
  5. Query scorers don't have a random access API, only .advance(),
    which does unecessary extra work .next()'ing to the next matching
    document.
To fix this correctly, Lucene really needs an optimization stage, much like a modern database, which looks at the type and structure of the query and filter, as well as estimates of how many documents will match, and then picks an appropriate execution path. Likely this will be coupled with source code specilization/generation (LUCENE-1594) to write dedicated java code to execute the chosen path. Some day we'll get to that point!

Until then, there in a simple way to get a large speedup in many cases, addressing the 4th issue above. Prior to flexible indexing, when you obtained the postings enumeration for documents matching a given term, Lucene would silently filter out deleted documents. With flexible indexing, the API now allows you to pass in a bit set marking the documents to skip. Normally you'd pass in the IndexReader's deleted docs. But, with a simple subclass of FilterIndexReader, it's possible to use any filter as the documents to skip.

To test this, I created a simple class, CachedFilterIndexReader (I'll attach it to LUCENE-1536). You pass it an existing IndexReader, plus a Filter, and it creates an IndexReader that filters out both deleted documents and documents that don't match the provided filter. Basically, it compiles the IndexReader's deleted docs (if any), and the negation of the incoming filter, into a single cached bit set, and then passes that bit set as the skipDocs whenever postings are requested. You can then create an IndexSearcher from this reader, and all searches against it will be filtered according to the filter you provided.

This is just a prototype, and has certain limitations, eg it doesn't implement reopen, it's slow to build up its cached filter, etc.

Still, it works very well! I tested it on a 10M Wikipedia index, with a random filter accepting 50% of the documents:

QueryQPS DefaultQPS Flex% change
united~0.719.9519.25-3.5%
un*d43.1952.2120.9%
unit*21.5330.5241.8%
"united states"6.128.7442.9%
+united +states9.6814.2347.0%
united states7.7114.5688.9%
states15.7336.05129.2%

I'm not sure why the fuzzy query got a bit slower, but the speedups on the other queries are awesome. However, this approach is actually slower if the filter is very sparse. To test this, I ran just the TermQuery ("states"), against different filter densities:

The cutover, for TermQuery at least, is somewhere around 1.1%, meaning if the filter accepts more than 1.1% of the index, it's best to use the CachedFilterIndexReader class; otherwise it's best to use Lucene's current implementation.

Thanks to this new flex API, until we can fix Lucene to properly optimize for filter and query intersection, this class gives you a viable, fully external, means of massive speedups for non-sparse filters!

Tuesday, August 24, 2010

Here comes another bubble

Hilarious. It's a spoof of Billy Joel's catchy We Didn't Start the Fire. Scary that this was posted more than two and a half years ago!

Saturday, August 7, 2010

Do babies really sleep?

Those of you with kids have undoubtedly discovered that sometimes they sleep with their eyes half open.

The first time it happens it's somewhat alarming, but, then, you get used to it and realize it's "normal". It probably just means their eyelids haven't quite closed yet.

Well, my one-year-old did this, yesterday, so I seized the opportunity to do an experiment!! I wanted to test just how asleep she was, so, I smiled like crazy, made funny faces, scrunched my nose, stuck my tongue out, at her half open eye.

And get this: she laughed! In her sleep. I repeated the test 3 times and each time she laughed. Yet, she was most definitely "asleep" -- the eyelid eventually closed and she didn't wake up for another hour.

I guess enough of the brain remains active to process what the eye is seeing even when the rest of the brain is sleeping? Wild!

Before you think I'm being unfair, she also conducts sleep experiments on us! Last weekend the whole family went camping. It was bed time and, as usual, we were all exhausted except our one-year-old. There was enough low light so she could see, so, as we were all dropping off one by one, she took turns running back and forth and pulling hard on our chins to wake us up again! Probably, from her perspective, this was like a horror movie, where her family is one by one dropping into comas and it's her job to prevent the disaster!

Kids are great fun!

Monday, August 2, 2010

Lucene performance with the PForDelta codec

Today, to encode the postings (docs, freqs, positions) in the index, Lucene uses a variable byte format where each integer is individually encoded as 1-5 bytes.

While this is wonderfully simple, it requires an if statement on every byte during decode, which is very costly since the CPU cannot easily predict the branch outcome.

To reduce the number of branches per int decode, you must switch to decoding groups of ints at once. This paper describes a few such encoders (PForDelta, Rice Coding, Simple9, Simple16). Chapter 6 (Index Compression) of Information Retrieval: Implementing and Evaluating Search Engines goes into great detail on these and others, including an addenda showing results for Google's Group VarInt encoder.

The IntBlock codec is designed to make it easy to experiment with these sorts of different int encoders. It does the "hard part" of enabling Lucene to operate on a set of ints at once, which is somewhat tricky as the seek points (in the terms dict and the skipping lists) must now encode both the file-pointer where the block starts as well as the index (starting int) into the block.

There is already a low-level initial implementation Frame-of-reference (FOR) and Patched frame-of-reference (PFOR), on LUCENE-1410, as well as Simple9/16 on LUCENE-2189, thanks to Paul Elschot and Renaud Delbru.

FOR is a simple encoding: it takes each fixed block of ints, ands stores them all as packed ints, where each value gets N bits, set by the maximum int in the block. So say our block size is 8, and we have these ints:

1 7 3 5 6 2 2 5

we'd only need 3 bits per value. But, FOR has a clear weakness: if you have a single big int mixed in with a bunch of small ones, then you waste bits. For example if we have these ints:

1 7 3 5 293 2 2 5

then FOR must use 9 bits for all values (because of 293), despite the fact that the other values only needed 3 bits each. How much this hurts in practice remains to be seen.

PFOR fixes this by marking such large numbers as exceptions, which are then "patched" after the initial decode. So for the above example, we would still use 3 bits per-value, but we'd mark that the 293 value was an "exception" and it would be separately encoded at the end (with more bits), and then "patched" back in at decode time. The above paper has the details.

Note that block-based codecs are not always a good fit, since in order to access even one int within the block, they [typically] must decode the full block. This can be a high cost if you frequently need to decode just an int or two, such as with a primary key field. Pulsing codec is a better fit for such fields.

During testing, I found a few silly performance problems in the IntBlock codec, which I've hacked around for now (I'll post my patch on LUCENE-1410); the hacks are nowhere near committable, but are functioning correctly (identical search results for the queries I'm testing). I also made some more minor optimizations to the FOR/PFOR implementation. I'd like to get the IntBlock codec to the point where anyone can easily tie in a fixed or variable block size int encoding algorithm and run their own tests.

I indexed the first 10 million Wikipedia ~1KB sized docs. The resulting index sizes were:

CodecSize% Change
Standard3.50 GB 
FOR3.96 GB13.3% bigger
PFOR3.87 GB11.1% bigger

The size increase of both FOR and PFOR are fairly low. PFOR is smaller than FOR, as expected, but it's surprising that it's only a little bit lower. Though, this is a function of where you draw the line for the exception cases and will be corpus dependent. Still, I'm encouraged by how little additional space FOR requires over the Standard codec.

I ran a set of queries and measured the best time (of 40 runs) for each, across 4 threads. Here are the results for FOR:

QueryQPS StandardQPS FOR% Change
united~0.65.595.08-9.0%
"united states"11.4611.22-2.1%
united~0.718.3319.234.9%
united states15.5318.6620.1%
uni*d34.2543.3726.6%
unit*31.2741.0731.3%
states55.7275.8236.1%
+united +states15.1721.4341.2%

and for PFOR:

QueryQPS StandardQPS PFOR% Change
united~0.65.595.02-10.1%
"united states"11.4610.88-5.1%
united~0.718.3319.003.7%
united states15.5318.1116.6%
uni*d34.2543.7127.6%
states55.7271.3928.1%
unit*31.2741.4232.5%
+united +states15.1721.2540.1%

They both show good gains for most queries; FOR has a slight edge since it doesn't have to decode exceptions. The united~0.6 (max edit distance = 2) is slower, likely because a good number of the 519 terms it expands to have low frequency, and so the overhead of a full block decode is hurting performance. The PhraseQuery also got a little slower, probably because it uses non-block skipping during searching. The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. Likely a hybrid codec for Lucene, using FOR or FFOR for high frequency terms, Standard for medium frequency, and Pulsing for very low frequency, would perform best overall.

I ran one additional test, this time using FOR on MMapDirectory (instead of NIOFSDirectory used above), passing an IntBuffer from the mapped pages directly to the FOR decoder:

QueryQPS StandardQPS FOR% Change
united~0.66.005.28-12.1%
united~0.718.4820.5311.1%
"united states"10.4711.7011.8%
united states15.2120.2533.2%
uni*d33.1647.5943.5%
unit*29.7145.1451.9%
+united +states15.1423.6556.2%
states52.3088.6669.5%

The results are generally even better for two reasons. First, for some reason the Standard codec slows down a bit with MMapDirectory. Second, the FOR codec speeds up substantially, because I'm able (hacked up though) do a no-copy decode with MMapDirectory.

Remember that these results are very preliminary, based on a patch with many non-committable hacks (eg deleted docs are ignored!), that doesn't pass all Lucene's unit tests, etc. There are also still further optimizations to explore, and lots of work remains. So it's not clear where the final numbers will be, but these initial results are very encouraging!

If you have any other algorithms to try for encoding ints, pleaes pipe up! It's very easy to have Lucene generate a few billion ints for you to encode/decode if you want to run standalone tests.

Wednesday, July 14, 2010

Moving readVInt to C

By far the hottest spot in Lucene during searching is the method (DataInput.readVInt) that decodes Lucene's variable-length integer representation (vInt). This method is called an insane number of times, while iterating the postings lists (docs, freqs, positions), during query execution.

This representation is not CPU friendly: for every single byte read, the CPU hits a hard-to-predict if statement (testing the high bit). Alternative block-based formats do a good job reducing branches while decoding, such as these encodings (PFOR-DELTA is currently an experimental codec attached as a patch on LUCENE-1410) or Google's Group Varint.

Anyway, I decided to test whether porting readVInt to C would gives us a performance boost. I had known the overhead of JNI was highish in the past, but I was hoping in modern JREs this had been improved.

Unfortunately, it wasn't: I see a ~10%-~40% slowdown, depending on the query. Likely hotspot's inability to inline native methods is also a factor here. Perhaps, if/when we switch to a block-based codec, we'll see a gain with a native implementation, since the amortized overhead of invoking JNI will be lower.

Tuesday, July 13, 2010

Apple censorship, again

It's sickening that Apple thinks it's OK to censor (remove) posts from their forums. It's not the first time they've done this, either.

I consider it entrapment -- Apple hosts these forums, as the obvious place where we all can go to discuss any issues we have. As with any forum, there's a natural implicit assumption of non-interference, much like I expect my ISP not to block my visits to random web-sites or my email provider to block certain emails or the local coffee shop to disallow discussions about certain topics.

But then, suddenly, Apple acts like China: they censor that which they disagree with.

C'mon Apple!

You should be above such draconian behavior. If you disagree with what's being said, play fair! Come out and discuss the topic, explain your viewpoint, convince us.

Saturday, July 10, 2010

What motivates us

Great video, from Daniel Pink, animated by RSA, summarizing his book digging into what motivates people to do things. I find it very accurate about myself, and I expect especially people working in open-source projects will resonate strongly with the surprising findings.

Encoding videos for Apple iPad/iPod, using mencoder

After much searching, poking, iterating, I found a magical mencoder command-line that encodes videos in the right format for the ipad. Debugging this is hard because, if you get something wrong, iTunes simply silently fails to open the video file, giving no specific feedback as to what's wrong. I felt like a monkey on a typewriter...

So here are the options, broken by group. Set the output format to MP4:

-of lavf -lavfopts format=mp4

Scale the video to width 1280, height to match the aspect ratio (if your source video is low resolution, eg DVD, leave this off; you can also crop (-vf crop=W:H:X:Y; run mplayer -vf cropdetect,scale to have it guess the crop for you), or de-interlance (I like -vf pp=lb), or any other video filter):

-vf scale=1280:-3

Use x264 codec for video, with a zillion options (tweak that crf=XX if you want higher quality; lower values of XX give higher quality, but take more space):

-ovc x264 -x264encopts crf=28:vbv_maxrate=1500:nocabac:global_header:frameref=3:threads=auto:bframes=0:subq=6:mixed-refs=0:weightb=0:8x8dct=1:me=umh:partitions=all:qp_step=4:qcomp=0.7:trellis=1:direct_pred=auto

Use faac codec for audio:

-oac faac -faacopts br=160:mpeg=4:object=2:raw -channels 2 -srate 48000

I'm sure many other command-line combinations would work; this was just the first combo I found that works.

If you need to encode for the iPod Touch, just change the -vf scale=1280:-3 to -vf scale=640:-3 (at least this works for recent iPod Touch models).

You can also use the nifty mp4art tool to insert cover art after the encode finishes; this lets you set the image for the movie when browsing on the iPad/iPod or in iTunes, which is vital if young kids will be launching the movies! I use PNG images, which work fine.

Friday, July 9, 2010

Old Orchard Beach

We spent the past half week in Old Orchard Beach, Maine (US), in the Crest Motel. It's a great place! The rooms are clean, the beach is right outside, and it's a short walk to Palace Playland. They have an indoor pool & hot tub, and I'm happy to report that they do not discriminate against kids. The whole area is very family friendly, and the kids had a blast!

But what I loved most was... while we were staying there, they happened to be installing Solar hot water panels! At first we saw random holes drilled in the floor and ceiling, which made all of us very curious. Then the next day these were filled in with insulated pipes. Finally, this morning, we saw the solar panels themselves, being carried up and installed on the roof. The location is excellent -- they have a large roof, and no tall trees anywhere nearby. It looked like a high capacity installation, maybe ~250 square feet. They expect to fully generate all hot water in the summer, and also a sizable portion of the (air) heat in the winter.

I have a small (600W nominal, off-grid) solar electric installation in my house, self-installed. In the summer, it easily offsets the electricity used by our home theater. I involved the kids all throughout the installation, to teach them how important it is to find renewable sources for everything. And whenever they or their friends watch movies, I remind them that it's all powered by sunshine electricity, which of course leads to a great curious discussion and more teachable moments.

So I took this chance to (again) hammer home this vital lesson. And, I'm very happy to see that others are adopting Solar energy!

Communism vs capitialism/democracy

While we in the "free" world like to believe our capitalistic, democratic way is better than the tight-fisted communist approach in, say, China, the situation is not really so clear cut.

The Chinese government has full control to make changes to nearly everything in the country. Yes, sometimes this power is used to awful ends, such as human rights violations and the great firewall of China.

But then, this same unilateral power can lead to great progress, overnight. For example, China plans to add a 5% tax on oil and gas consumption (plus other "raw materials"). The US, in contrast, has for a very long time offered massive tax breaks to the oil companies (this is why the price of our gasoline is horribly low compared to nearly every other country, encouraging us to buy massive, awfully fuel inefficient cars and trucks).

Another example: a few years back, China suddenly required that all cell phone chargers adopt the same standard, so that when people buy new phones the do not need a new charger, thus eliminating a big source of electronics waste. In contrast, in the US with our "new every 2", we have to throw away our chargers every time we upgrade since they rarely interoperate.

A third example is fuel economy standards: China has for a long time had far more stringent requirements than the US.

In contrast, accomplishing these excellent improvements in the US is nearly impossible: each small change requires a tremendous battle through our congress, many members of which are rather blatantly corrupt, accepting all sorts of creative bribes (campaign contributions) from corporations, to buy their vote. And corporate influence got even stronger thanks the recent Supreme Court landmark decision giving corporations much more freedom to influence elections. Finally, congress members, especially these days, seem to vote almost always along party lines rather than what's actually best for our country's future.

Don't get me wrong: net/net I'm quite happy I live in the US, given all the tradeoffs. But, still, we can and should do better, and copying some of China's recent changes would be a great start.

Monday, July 5, 2010

A better grass

This Pearl's premium grass looks awesome: mow only once per month; extremely drought tolerant (seldom or never water); thrives without chemicals. It's the polar opposite of life-support grass, that's the gold standard in today's yards.

Maybe, once our yard is in really bad shape, I'll try this seed, mixed with clover. I want our yard to have zero environmental cost, zero carbon footprint. We currently have just that: we haven't watered our lawn in 2 years; no fertilizer, pesticides; no dethatching, mechanical aeration; very rarely raked, mowed; etc., except, it's slowly dying, and weeds are moving in, because we inherited prior life-support grass.

I'd also like near-zero effort on our part (we don't want to hire a lawn care service because we feel that sets a bad example for our kids), and of course to have decent curb appeal. Surely this is not too much to expect?

I wish there were more companies/people developing environmentally free/friendly alternatives to the standard life-support grass. I'd do a bake-off with these alternatives on different spots in our challenging yard!

Ethernet lightning protection

I'm going to try using this APC ethernet surge suppressor to protect the important devices (desktop, laptop, filer, backups, FIOS) attached to our ethernet network.

Lucene's RAM usage for searching

For fast searching, Lucene loads certain data structures entirely into RAM:

  • The terms dict index requires substantial RAM per indexed term (by default, every 128th unique term), and is loaded when IndexReader is created. This can be a very large amount of RAM for indexes that have an unusually high number of unique terms; to reduce this, you can pass a terms index divisor when opening the reader. For example, passing 2, which loads only every other indexed term, halves the RAM required. But, in tradeoff, seeking to a given term, which is required once for every TermQuery, will become slower as Lucene must do twice as much scanning (on average) to find the term.

  • Field cache, which is used under-the-hood when you sort by a field, takes some amount of per-document RAM depending on the field type (String is by far the worst). This is loaded the first time you sort on that field.

  • Norms, which encode the a-priori document boost computed at indexing time, including length normalization and any boosting the app does, consume 1 byte per field X document used for searching. For example, if your app searches 3 different fields, such as body, title and abstract, then that requires 3 bytes of RAM, per document. These are loaded on-demand the first time that field is searched.

  • Deletions, if present, consume 1 bit per doc, created during IndexReader construction.

Warming a reader is necessary because of the data structures that are initialized lazily (norms, FieldCache). It's also useful to pre-populate the OS's IO cache with those pages that cover the frequent terms you're searching on.

With flexible indexing, available in Lucene's trunk (4.0-dev), we've made great progress on reducing the RAM required for both the terms dict index and the String index field cache (some details here). We have substantially reduced the number of objects created for these RAM resident data structures, and switched to representing all character data as UTF8, not java's char, which halves the RAM required when the character data is simple ascii.

So, I ran a quick check against a real index, created from the first 5 million documents from the Wikipedia database export. The index has a single segment with no deletions. I initialize a searcher, and then load norms for the body field, and populate the FieldCache for sorting by the title field, using JRE 1.6, 64bit:

  • 3.1-dev requires 674 MB of RAM

  • 4.0-dev requires 179 MB of RAM

That's a 73% reduction on RAM required!

However, there seems to be some performance loss when sorting by a String field, which we are still tracking down.

Note that modern OSs will happily swap out RAM from a process, in order to increase the IO cache. This is rather silly: Lucene loads these specific structures into RAM because we know we will need to randomly access them, a great many times. Other structures, like the postings data, we know we will sweep sequentially once per search, so it's less important that these structures be in RAM. When the OS swaps our RAM out in favor of IO cache, it's reversing this careful separation!

This will of course cause disastrous search latency for Lucene, since many page faults may be incurred on running a given search. On Linux, you can fix this by tuning swappiness down to 0, which I try to do on every Linux computer I touch (most Linux distros default this to a highish number). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.

Wednesday, June 30, 2010

Our house was hit by lightning

Believe it or not, our house was struck by lightning! It happened a few days ago, as a cold front swept over, bringing with it some intense but short-lived thunderstorms.

I was working on my computer when I heard a loud POP sound of a spark, behind me, in the utility closet. At the same time my computer went blank, and there was insanely loud thunder clap. My poor son was on the toilet at the time and told me he was so startled that he jumped high in the air and almost fell in! Fortunately, none of us were hurt.

The strike destroyed our central 16-port gigabit ethernet switch, 3 out of 4 LAN ports on my FIOS NAT box, a couple power supplies and one netcam. It also fried the device I use to read the electrical (charging, inverting) data from the solar panels in my back yard, but the solar panels themselves, including the thick copper ground wires designed to "guide" lightning into the ground and away from the house, were all fine, as well as the charger, inverter and batteries. My 1-Wire network, which I use to measure various indoor & outdoor temperatures, is also still dead. My wife's computer immediately shut down and rebooted, several times (spooky), but apparently unharmed. My computer seemed to lose both ethernet ports, but then after much rebooting and testing plug-in ethernet cards, they came back to life.

A large tree branch in our neighbor's yard fell down; the neighbors across the street called the fire department; yet another neighbor saw bright sparks in his basement and also lost a bunch of electronics.

Almost certainly this was not a direct strike for us; otherwise things would have been vaporized instead of simply dead. Instead, the sudden immense electro-magnetic field created at the direct strike radiates outward, creating the local equivalent of an EMP bomb. This EMF then induces high voltage and current in any wires it crosses; the closer you are to the direct strike, and the longer your wires are, the more damaing the induced voltage and current is. In my case, apparently, the extensive network of ethernet wires in my house cause most of the damage. This is a good reason to use WiFi!

I will now buy myself something to try to prevent this from happening again.

Lightning is crazy stuff. The process the lightning goes through in seeking the path through which it will dump insane amounts of current is fascinating. National Geographic has a great facts page; for example, talking on your land-line telephone is the leading cause of lightning injuries inside the home. I suspect we may have been hit by positive lightning, because we seemed to be hit, out of the blue, well before the storm itself seemed to arrive.

Lightning strikes are apparently rather common; in just my immediate family this has now happened twice to me, and once each to my brother, father and grandparents!

Tuesday, June 29, 2010

Lucene in Action 2nd Edition is done!

Lucene in Action, 2nd Edition, is finally done: the eBook is available now, and the print book should be released on July 8th!

The source code that goes along with the book is freely available and free to use (Apache Sofware License 2.0), and there are two free chapters (Chapter 1, and Chapter 3). There is also a free green paper excerpted from the book, Hot Backups with Lucene, as well as the section describing CLucene, the C/C++ port of Lucene.

Writing is the best way to learn something -- because of this book I've learned all sorts of nooks and crannies in Lucene that I otherwise would not have explored for quite some time.

Monday, June 21, 2010

Beware String.substring's memory usage

I'm trying to build up a test index for testing Lucene's sort performance, to track down a regression in String sorting performance between 3.x and 4.0, apparently from our packed ints cutover.

To do this, I want to use the unique title values from Wikipedia's full database export.

So I made a simple task in Lucene's contrib/benchmark framework to hold onto the first 1M titles it hits. Titles tend to be small, say maybe average worst case 100 characters per document, so worst case RAM would be ~200 MB or so, right?

Wrong!

It turns out, in Java, when you call String's substring method, the resulting String returned to you keeps a reference to the original String, so the original String can never be GC'd if you hold onto the substring. Java can do this "optimization" because Strings are immutable.

For me, this "optimization" is a disaster: the title is obtained by getting the substring of a large string (derived from a line-doc file) that holds the full body text as well! Instead of ~200 characters per unique title I was looking at ~25K characters! Ugh.

Fortunately, the workaround is simple -- use the String constructor that takes another String. This forces a private copy.

I imagine for many cases this "optimization" is very worthwhile. If you have a large original string, and pull many substrings from it, and then discard all of those substrings and the original string, you should see nice gains from this "optimization".

There is a longstanding bug opened for this; likely it will never be fixed. Really, GC should be empowered to discard the original string and keep only the substring. Or perhaps substring should have some heuristics as to when it's dangerous to keep the reference to the original String.

Sunday, June 20, 2010

Geek Dad

We've had this great family ritual, for a few years now: at the start of every summer we pitch a tent in our back yard! We leave it there all summer, and the kids have great fun, with neighbors and friends, playing in it, hiding from the rain, etc.

We also pick a few nights to sleep in the tent, which feels almost as real as camping, yet you have the delightful freedom of taking a short walk back to the house if you forgot something.

So, last night we slept in the tent. But, this year I brought our new iPad with us. The kids took turns playing a few games; one of their favorites is Gomi HD, a game I also love for its not-so-subtle message that we humans are screwing up the planet and it's up to the kids to fix it. Then we watched Shrek 2, which I had previously encoded for the iPad (using Handbrake). Finally, the kids, one by one, fell asleep.

Then, in the middle of the night, I woke up and heard rain hitting the tent. I was worried that it could turn into a big storm (in past years our tent has been literally flattened by passing thunderstorms, once when we were inside!), so, I turned on the iPad and loaded the local NexRAD radar, and confirmed that in fact it was just a small passing cell: phew!

Finally, I woke up, and was ready to get up, so I used the iPad again, this time to start the coffee maker. I have a simple Web server, written in Python, that exposes an HTML interface to variance lights and appliances controlled via Insteon, including the coffee maker. It's vital that I minimize the time from when I first stand up in the morning to when I consume my coffee!

Yes, I'm a geek Dad.

Monday, June 14, 2010

Lucene and fadvise/madvise

While indexing, Lucene periodically merges multiple segments in the index into a single larger segment. This keeps the number of segments relatively contained (important for search performance), and also reclaims disk space for any deleted docs on those segments.

However, it has a well known problem: the merging process evicts pages from the OS's buffer cache. The eviction is ~2X the size of the merge, or ~3X if you are using compound file.

If the machine is dedicated to indexing, this usually isn't a problem; but on a machine that's also searching, this can be catastrophic as it "unwarms" your warmed reader. Users will suddenly experience long delays when searching. And because a large merge can take hours, this can mean hours of suddenly poor search performance.

So why hasn't this known issue been fixed yet? Because Java, unfortunately, does not expose access to the low-level APIs (posix_fadvise, posix_madvise) that would let us fix this. It's not even clear whether NIO.2 (in Java 7) will expose these.

On the Lucene dev list we've long assumed that these OS-level functions should fix the issue, if only we could access them.

So I decided to make a quick and dirty test to confirm this, using a small JNI extension.

I created a big-ish (~7.7G) multi-segment Wikipedia index, and then ran a set of ~2900 queries against this index, over and over, letting it warm up the buffer cache. Looking at /proc/meminfo (on Linux) I can see that the queries require ~1.4GB of hot RAM in the buffer cache (this is a CentOS Linux box with 3G RAM; the index is on a "normal" SATA hard drive). Finally, in a separate JRE, I opened an IndexWriter and called optimize on the index.

I ran this on trunk (4.0-dev), first, and confirmed that after a short while, the search performance indeed plummets (by a factor of ~35), as expected. RAM is much faster than hard drives!

Next, I modified Lucene to call posix_fadvise with the NOREUSE flag; from the man page, this flag looks perfect:

Specifies that the application expects to access the specified data once and then not reuse it thereafter.

I re-ran the test and.... nothing changed! Exactly the same slowdown. So I did some digging, and found Linux's source code for posix_fadvise. If you look closely you'll see that the NOREUSE is a no-op! Ugh.

This is really quite awful. Besides Lucene, I can imagine a number of other apps that really should use this flag. For example, when mencoder slowly reads a 50 GB bluray movie, and writes a 5 GB H.264 file, you don't want any of those bytes to pollute your buffer cache. Same thing for rsync, backup programs, software up-to-date checkers, desktop search tools, etc. Of all the flags, this one seems like the most important to get right! It's possible other OSs do the right thing; I haven't tested.

So what to do?

One approach is to forcefully free the pages, using the DONTNEED flag. This will drop the specified pages from the buffer cache. But there's a serious problem: the search process is using certain pages in these files! So you must only drop those pages that the merge process, alone, had pulled in. You can use the mincore function, to query for those pages that are already cached, so you know which ones not to drop. A neat patch for rsync took exactly this approach. The problem with this is mincore provides only a snapshot, so you'd have to call it many times while merging to try to minimize discarding pages that had been recently cached for searching.

We should not have to resort to such silly hacks!

Another approach is to switch to memory-mapped IO, using Lucene's MMapDirectory, and then use madvise. The SEQUENTIAL option looks promising from the man page:

Expect page references in sequential order. (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)

Looking through the linux sources it look like the SEQUENTIAL option is at least not a no-op; that setting has some influence over how pages are evicted.

So I tested that, but, alas, the search performance still plummets. No go!

Yet another approach is to bypass all OS caching entirely, only when merging, by using the Linux-specific O_DIRECT flag. Merge performance will definitely take a hit, since the OS is no longer doing readahead nor write caching, and every single IO request must hit the disk while you wait, but for many apps this could be a good tradeoff.

So I created a prototype Directory implementation, a variant of DirectNIOFSDirectory (currently a patch on LUCENE-2056), that opened all files (input and output) with O_DIRECT (using jni). It's a little messy because all IO must be "aligned" by certain rules (I followed the rules for 2.6.* kernels).

Finally, this solution worked great! Search performance was unchanged all through the optimize call, including building the CFS at the end. Third time's a charm!

However, the optimize call slowed down from 1336 to 1680 seconds (26% slower). This could likely be reduced by further increasing the buffer sizess (I used 1 MB buffer for each IndexInput and IndexOutput, which is already large), or possibly creating our own readahead / write cache scheme.

We really should not have to resort to such silly hacks!

Sunday, June 6, 2010

Finding the lead in your house

It's known that exposure to lead, even in tiny amounts, causes loss of intelligence and other nasty problems in children.

We have now phased out lead paint, leaded gasoline, and lead (and other dangerous elements) in electronics. But, surprisingly, it was only relatively recently (2008) with the passage of Consumer Product Safety Improvement Act that we finally set limits on the amount of lead in consumer items, particularly kids toys. As of June 2010, the legal limit for lead in kid's toys is 300 ppm (parts per million) and will drop to 100 ppm by August 2011. The legal limit for cadmium in paint on kid's toys is 75 ppm.

Our house is filled with all sorts of toys, many of which we've picked up as hand-me-downs from various sources. I've long wondered whether these toys have lead, cadmium, etc...

So, I decided to test them myself, using an XRF analyzer. This amazing handheld device emits a small directed xray beam out the front, causing elements to fluoresce with specific spectral signatures. The device detects the strength of these signatures and reports back to you the breakdown of elements in the sample, either in parts per million (ppm) or in percentage (1% = 10,000 ppm!).

In addition to lead, the device reliably detects other elements like cadmium, bromine (used in brominated flame retardants), chlorine, arsenic, mercury, tin, antimony, chromium, and many others!

There are some limitations. For example, all of the door knobs in my house tested high (1-2%) for lead, however, it turns out they are nickel plated but the XRF analyzer was seeing through this layer to the lead beneath. Likely this lead would never transfer to our hands, unless the nickel wears through.

Another limitation is that the device detects elements, not their molecular form. For example, certain forms of chromium, like hexavalent chromium, are toxic, while other forms, like trivalent chromium, is useful and necessary in the human body.

Finally, just because a material contains a given element doesn't mean that element would ever find its way into a child's body. For example, lead bound in plastic is likely difficult to dislodge.

For all these reasons, just because the analyzer detects certain elements in a given item, does not mean the item could cause harm. Think of the analyzer as a simple fact-finder: it reports the raw elements it sees. What action you then choose to take is your decision. However, the precautionary principle applies here: if an item does have measurable amounts of lead or cadmium, why risk exposing your family to it?

While the devices are still insanely expensive, they have come down in price enough to make rental just barely within reach for the end consumer. I rented mine (a Niton XL3T) for a few hundred dollars a day from from Ashtead Technology, and split the cost with a neighbor (she tested by night and I tested by day!).

So I walked around my house, testing everything, and the results were a real eye-opener! A number of ordinary looking things had lead, sometimes at levels far higher than the 300 ppm legal limit, plus other possibly harmful elements:



The green rain coat has 6,537 ppm lead; the handle on the lacrosse stick has 14,700 ppm lead; the basketball has 3,320 ppm lead and 322 ppm arsenic; the lunchbox has 677 ppm lead; the doctor's bag has 2,511 ppm antimony and 55 ppm arsenic. Many smaller toys also have lead:



The red measuring spoon has 1,651 ppm lead; the blue dinosaur has 767 ppm lead and 91 ppm cadmium; the red car has 860 ppm lead; the red plate has 3,268 ppm lead; the blue train has 271 ppm lead; the little green ABC has 1,015 ppm lead. The three legos were surprising, having between 1,245 and 2,427 ppm lead -- they are not actually legos; they are a copycat brand (Mega Bloks). All of our "real" legos tested fine.

Other toys have high cadmium:



The little slinky has 527 ppm cadmium and 1,365 ppm lead; the salt shaker has 22,634 ppm cadmium; the ice cream scoop has 1,188 ppm cadmium. Here were a bunch of other small toys that had varying amounts of lead:



Small toys are especially spooky since babies are more likely to put them in their mouth. These pictures show only about 1/3rd of the things I found with lead! Here are some other surprising discoveries:

  • Old Christmas ornaments often have very high lead. I had one silver ball that had ~1% arsenic, ~20% lead. Worse, it was well worn, which means the arsenic and lead had rubbed of onto people's hands, over the years.

  • Car keys have lead (~6,700 ppm for my 2 keys)! Don't let your babies play with them.

  • Nearly every wire has high lead content in the insulation; christmas lights had especially high lead.

  • The soft (usually black) plastic/rubber kids bike handles, and also tricycles, are often very high in lead.

  • An old recliner had 29,500 ppm lead on the foot rest and 20,000 ppm lead on the arm rests.

  • Our door knobs, cabinet knobs, faucets, and spouts, had ~1-2% lead.

  • We had a jar of kid's vitamins. The vitamins tested OK, but the jar (it was a dark brown tint) had 140 ppm lead.

  • One of our garden hoses had 5,255 ppm lead; not great because we had used this hose to water the plants in our edible garden.

  • Our waffle iron had 1,143 ppm lead on the cooking surface (likely, though, this was the metal under the teflon).

  • My supposedly lead-free solder (was not cheap!!) in fact contains 0.5% lead.


Here are some "rough" patterns:

  • The older the toy, the more likely it is to have high lead, cadmium, etc.

  • Newer things that are "burnable" (fleece, bedding, futons, mattresses, beds, plush chairs, etc.) often have very high (10s of thousands ppm) levels of bromine.

  • Colors like yellow, orange, red often have lead in their pigment.

  • Beware glasses that have paint on them! We had one glass that was 8% lead -- 266 times the legal limit. Colored glass (where the color is embedded into the glass) are also more likely to be leaded.


I'm not the only consumer playing with XRf analyzers: a recent recall of 12M Shrek glasses by McDonald's was initiated by at least two people who discovered high levels of cadmium in the paint on the glass using XRF analyzers.

Saturday, June 5, 2010

Lucene's PulsingCodec on "Primary Key" Fields

Update Aug, 2014: the pulsing approach described here works well and has now been incorporated into Lucene's default postings format, so there's really no need to use PulsingPostingsFormat yourself unless you are using a custom postings format that doesn't do its own pulsing.

Flexible indexing in Lucene (now available on trunk, which will eventually be the next major release, 4.0) enables apps to use custom codecs to write/read the postings (fields, terms, docs, positions, payloads).

By default, Lucene uses the StandardCodec, which writes and reads in nearly the same format as the current stable branch (3.x). Details for a given term are stored in terms dictionary files, while the docs and positions where that term occurs are stored in separate files.

But there is an experimental codec, PulsingCodec, which implements the pulsing optimization described in a paper by Doug Cutting and Jan Pedersen. The idea is to inline the docs/positions/payloads data into the terms dictionary for low frequency terms, so that you save 1 disk seek when retrieving document(s) for that term.

The PulsingCodec wraps another fallback Codec that you provide; this allows the pulsing to be dynamic, per term. For each term, if its frequency (the number of documents that it appears in) is below a threshold (default 1) that you provide, then that term's postings are inlined into the terms dictionary; otherwise, the term is forwarded (pulsed) to the wrapped codec. This means PulsingCodec should be helpful for ordinary text fields which obey Zipf's Law, as many terms will be rare-ish.

PulsingCodec should really shine on "primary key" fields, where each term occurs in exactly one document, and batch lookups (for example because the app performs deletes, updates and/or lookups) are common.

I created a simple performance test to confirm this.

The test first creates an optimized index with 10M docs, where each doc has a single field with a randomly generated unique term, and then performs term -> doc lookup for N (parameter) random terms. It's a self-contained test (source code is here).

It's important to flush your OS's IO cache before running the test; otherwise you can't measure the reduced number of seeks. On recent Linux kernels, just run echo 1 > /proc/sys/vm/drop_caches. That said, in a real production usage, the IO cache will typically (legitimately) help you, and pulsing should make more efficient use of the IO cache since the postings data is contiguously stored.

To measure the speedup from using PulsingCodec on a primary key field, as well as the impact of the OS's IO cache, I ran the above test on an increasing number of random term lookups (always flushing the the OS's IO cache first):



The results are compelling! When performing a small number of term lookups relative to the total number of terms on a cold OS IO cache, which is likely the more common case in a real application, pulsing shows a ~45-50% speedup, as expected, since it requires 1/2 the seeks.

As the number of random term lookups increases, PulsingCodec's gains decrease, because more and more of the lookups are hitting the OS's IO cache and thus avoiding the seek (the machine I ran the test on had plenty of RAM to cache the entire index). It's interesting that PulsingCodec still shows ~15% gain once the lookups are mostly cached; likely this is because PulsingCodec saves the deref cost of finding the postings in the frq file.

Pulsing also makes the index a bit smaller (211 MB vs 231 MB), because it saves one vLong pointer per term. For the test, the index with pulsing had a 0 byte frq file since all postings were inlined into the terms dict. There is no prx file because I index the field with setOmitTermFreqAndPositions(true).

Note that the test case simply uses PulsingCodec for all fields; if you'd like per-field control you should use the PerFieldCodecWrapper. However, because PulsingCodec is dynamic (per term), it is likely a good default for all fields.

Another way to speed up primary key lookups through Lucene is to store your index on a solid-state disk, where seeks are much less costly than they are on spinning magnets (though, still several orders of magnitude more costly than RAM). Or better yet, do both!

Friday, May 14, 2010

Cancer and chemicals

The President's Cancer panel yesterday released their 2008-2009 annual report. It's a long (240 page) report, and what's amazing is that these are mainstream scientists who are now calling for precautionary reduction of our exposure to all sorts of chemicals we are now routinely exposed to.

It has some sobering facts about cancer, such as:

Approximately 41% of Americans will be diagnosed with cancer at some point in their lives, and 21% will die from it.

And:

The incidence of some cancers, including some most common among children, is increasing for unexplained reasons.

This this op-ed pulls out some great highlights from the report. For example, here's a depressing one:

Noting that 300 contaminants have been detected in umbilical cord blood of newborn babies, the study warns that: "to a disturbing extent, babies are born 'pre-polluted.' "

And then there's this quote:

The report blames weak laws, lax enforcement and fragmented authority, as well as the existing regulatory presumption that chemicals are safe unless strong evidence emerges to the contrary.

Congress is now attempting to address this, with the Safe Chemicals Act.

Here is a quote specifically about Bisphenol A:

Studies of BPA have raised alarm bells for decades, and the evidence is still complex and open to debate. That's life: In the real world, regulatory decisions usually must be made with ambiguous and conflicting data. The panel's point is that we should be prudent in such situations, rather than recklessly approving chemicals of uncertain effect.

This is an important point: during the time of uncertainty, when we don't know that a given chemical is dangerous nor do we know that it is safe, we should err on the side of caution, treating the chemical as guilty until proven innocent. I discovered that there is a name for this approach: the precautionary principle, and this is of course the core change to the Safe Chemicals Act.

Here's yet another quote, this time from a brief article about one of the scientists who discovered lead exposure, even in tiny amounts, is very dangerous for kids:

We've been very careless in simply presuming that chemicals are innocent until proven guilty," says Dr. Phillip Landrigan.

Thursday, May 13, 2010

Plants and animals use quantum mechanics?

Quantum mechanics, which Einstein once referred to as "spooky action at a distance", is a set of laws that govern how tiny (atomic & sub-atomic) things interact with one another. The laws are very different from classical physics, and really quite surprising, but nevertheless appear to be true (there's been much experimental validation).

Using quantum mechanics, it's possible to build a quantum computer, and indeed many research labs and at least one startup, have built simple ones. Quantum computers can do some amazing things, such as factoring integers very quickly, something classical computers can only do very slowly as the number gets bigger (as best we know, so far).

Most recently, a simplistic quantum computer was used to compute a discrete fourier transform using a single iodine molecule. Someday quantum computers will be all over the place...

But, isn't it possible that plants and animals have evolved to take advantage of quantum mechanics? Indeed there is evidence that we have!

The process of photosynthesis looks to be based on quantum entanglement.

And, one leading theory about how animals can smell so well is based on quantum vibrations. Luca Turin, who created this theory, has a good quote in that article:

Most people would probably feel that if it can be done at all, evolution has managed to make use of it.

And this makes sense - evolution is relentless at trying to find good ways to create plants and animals. Since quantum mechanics is real, evolution should have tapped into it.

This is also the reason I would expect Lamarckian inheritance to in fact be true. Any animal that can alter the traits of its offspring based on experiences in its own lifetime would clearly have a big advantage, so, evolution really should have found a way.

In fact, it sort of did, in a non-biological manner: language. We can pass on all sorts of life lessons to our kids, through language, and we get to stand on the shoulders of past giants, as we take for granted the knowledge created by the generations before us, passed on through language.

Monday, May 3, 2010

Lots of problems these days...

I try to keep my kids roughly informed about what's going on in the world; I think it's important they grow up with at least a basic world view.

Last night, my 7 year old son observed "you know there are alot of problems right now", and he's right! He then rattled off the Iceland volcano eruption, the ruptured water main in our state, forcing us to boil water before drinking it, and the disastrous oil rig explosion and subsequent and ongoing oil geyser.

Concord bans sale of bottled water

Fabulous! The town of Concord, MA has banned the sale of bottled water because of the wretched environmental impact these bottles have.

The environmental cost of such trash is stunning. We know this fills up our landfills, but have you heard of the Great Pacific Garbage Patch? This is one of 5 spots in the oceans where garbage collects and kills marine life.

Of course you really shouldn't drink bottled water in the first place.

Sunday, May 2, 2010

Your ideas and your name

I love this quote:

Your ideas will go further if you don't insist on going with them.

It's very true (note that I didn't tell you who said it)!

I find it especially applies to healthy open source projects. In Apache, the individuals (contributors, committers) who work on a given project are fleeting, transient. We will come and go. Our names are not attached to the code we commit.

Sinister search engine de-optimization

There's a large recall going on right now for many popular over-the-counter infant's and children's medicines, such as Motrin, Tylenol, and Zyrtec.

But if you look at the recall details page, posted by the company that manufactures these medicines (McNeil), you'll see that the table is actually a single JPEG image instead of an HTML table. If you don't believe me, try searching in your browser for the words you see in that table!

At first I thought "how strange -- why would they use an image instead of a normal HTML table?". But then a more sinister plot came to mind: perhaps they want to make it as hard as possible for future web searches to find this page. After all, they must now be in major damage control mode. It's the exact opposite problem of the more common search engine optimization.

Hiding text into a JPEG image to avoid searches finding you is a rather nasty practice, in my opinion (hmm, I see there's even this service to help you do it!). Google could prevent such sneakiness by running OCR on the image (perhaps they do this already -- anyone know?), but then I suppose the war would escalate and we'd start seeing barely readable tables like this that look like the dreaded Captcha tests. To workaround such companies, if we all link to this page with some real text (as I've done above) then Google will still find it!

It's also possible there's a more reasonable explanation for this maybe the good old saying "never attribute to malice that which can be adequately explained by something else" somehow applies?

Sunday, April 18, 2010

Safe Chemicals Act

Finally, there's been a bill introduced to congress, called the Safe Chemicals Act, to better regulate the chemicals we all are exposed to in day to day products.

The current law (from 1976) is ancient, and assumes any chemical is safe until proven otherwise -- innocent until proven guilty -- a dangerously lax approach which has resulted in the potentially dangerous chemicals we all have heard about, such as bisphenol A in certain plastics, brominated flame retardents, numerous phthalates released by vinyl (this causes that awful new shower curtain smell), etc.

There are 80,000 chemicals in use today and the EPA has only required testing for 200 of them. There are surely additional dangerous chemicals lurking, undetected.

Refreshingly, the new bill takes the opposite approach, the same I approach I take with my family: a chemical is guilty until proven innocent! This is why I use no pesticides on my life support grass.

If this bill becomes law, manufacturers must prove a chemical is safe before they can use it in their products. The burden of proof moves to the manufacturers. Here's a good writeup of the bill.

Lobbyists will undoubtedly fight this bill tooth and nail... so we need to push our congressional representatives!! You can use this site to do so.

Monday, March 8, 2010

Disagreements are healthy!

Another great quote, this time from Henry Ford:

If two people always agree, one if them is unnecessary.

The quote applies very well to open source development -- what makes open source so strong is that such wildly diverse people, with different backgrounds, ideas, approaches, languages, IDEs, interests, itches, sleeping habits, coffee addictions, etc., come together and work on the same problems.

And lots of disagreement ensues.

As long as the resulting discussions are driven by technical merit and not personal attacks (ie, The Apache Way), and consensus is eventually reached, then the disagreements are very powerful.

Thursday, February 25, 2010

Serenity, courage and wisdom

I love this quote:

Strive for the serenity to accept the things you cannot change;
courage to change the things you can; and wisdom to know the
difference.

It's the opening to the Serenity Prayer (from the Bible), but I've taken the liberty of replacing God grant with strive for.

The idiom "pick your battles" means the same thing as the 3rd goal.

This great quote from George Bernard Shaw relates closely to the 2nd goal:

The reasonable man adapts himself to the conditions that surround him...
The unreasonable man adapts surrounding conditions to himself...
Therefore, all progress depends on the unreasonable man

Think about these goals. You'll likely find that each goal is distinct, very important, and somewhat surprisingly often applies in your day to day life.

Thursday, February 4, 2010

Why do people vote against their own interests?

I found this article very interesting.

The first thing that struck me is its detached tone -- sort of like the curious look a child gives when looking inside a cage at an exotic animal. Probably this was written by someone who lives in Great Brittain, looking over the Atlantic ocean with mild curiosity at how crazy the US health care situation is.

The second thing that struck me is that the observation is very true. A huge majority of the population in this country would benefit from health care reform, including the public option. Those who cannot afford health insurance now, those with pre-existing conditions, etc.

Yet, many people who stand to benefit angrily fight reform. It's just plain weird. Here are two quotes from the article:
  • Why are so many American voters enraged by attempts to change a horribly inefficient system that leaves them with premiums they often cannot afford?

  • In Texas, where barely two-thirds of the population have full health insurance and over a fifth of all children have no cover at all, opposition to the legislation is currently running at 87%.
Finally, the two books referenced by the article certainly look relevant, roughly concluding that the average American doesn't really make decisions based on facts. (This also means "trial by a jury of your peers" is not a very comforting approach.) While I haven't read these books, I have come to the same depressing conclusion.

A democracy is only as effective as its population is at making rational decisions.