Wednesday, November 3, 2010

Big Fat Fiasco

I just watched this talk by Tom Naughton, describing the mis-steps and bad science over the years that have tricked us all into believing we should avoid fat and cholesterol in our diet when in fact we should be doing the exact opposite!

Tom is an excellent speaker (he's a comedian!), mixing in humor in what is at heart a very sad topic. He details the science that should have led us to conclude that excessive carbs and sugar consumption are in fact the root cause behind heart disease, diabetes and obesity, not fat and cholesterol as we've been incessantly told over the years.

He shows how the "Fat Lazy Slob Theory" (also called "calories in minus calories out") so frequently put forth to explain weight gain is in fact wrong, and that instead the root cause is the biochemistry in your body driving you to eat when your blood sugar is too high.

Tom's documentary movie Fat Head, which is getting great reviews on Amazon, delves into these same topics. I haven't watched it yet but I plan to.

So enjoy your butter, cheese, whole milk, eggs (with yolk!!) and cut back on sugars and carbs. And don't drink soda!

Monday, October 25, 2010

Our medical system is a house of cards

I just came across this great article about meta-researcher Dr. John Ioannidis. Here's the summary:

Much of what medical researchers conclude in their studies is misleading, exaggerated, or flat-out wrong. So why are doctors—to a striking extent—still drawing upon misinformation in their everyday practice? Dr. John Ioannidis has spent his career challenging his peers by exposing their bad science.

The gist is that modern medical research is deeply flawed and biased such that the "conclusions" that you and I eventually read in the news headlines are often false. I especially love his advice for us all:

Ioannidis suggests a simple approach: ignore them all

This is in fact my approach! I have a simple rule: if it tastes good it's good for you. So I eat plenty of fat, salt, sugar, cholesterol, carbs, etc. I love eggs and cheese and I always avoid low-fat or low-cholesterol foods. I get lots of sun and never use sun screen. I drink coffee and beer, daily. I drink lots of water. I get daily exercise, running and walking. And I avoid hand sanitizers like Purell (I believe commonplace dirt/germs are in fact natural and good for you). I strongly believe humans do not need pills to stay healthy. I don't take a daily vitamin. And I'm very healthy!

This short interview between Discover Magazine and Harvard clinician John Abramson echoes the same core problem. Here's a choice quote:

When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.

Unfortunately, the medical world has a deep, deep conflict of interest: healthy people do not generate profits. Capitalism is a horrible match to health care.

So, next time your doctor prescribes a fancy new cool-sounding powerful drug like Alevia or Omosia or Nanotomopia or whatever, try to remember that our medical system is really built on a house of cards. Your doctor, let alone you, cannot possibly differentiate what's true from what's false. Don't trust that large triple-blind random controlled trial that supposedly validated this cool new drug. You are the guinea pig! And it's only when these drugs cause all sorts of problems once they are really tested on the population at large that their true colors are revealed.

Sunday, October 17, 2010

Pics from BBQ after Lucene Revolution

I finally pulled the pics off my camera from last week's BBQ after Lucene Revolution in Boston, where much fun was had! See them here. It was awesome to finally meet everyone!

Saturday, October 9, 2010

Fun with flexible indexing

The Lucene Revolution conference just wrapped up yesterday. It was well attended (~300 or so people). It was great fun to hear about all the diverse ways that Lucene and Solr are being used in the real world.

I gave a talk about flexible indexing, coming in the next major release of Lucene (4.0). Slides are here.

Tuesday, October 5, 2010

Lucene's SimpleText codec

Inspired by this question on the Lucene user's list, I created a new codec in Lucene called the SimpleText codec. The best ideas come from the user's lists!

This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0). Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...

Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:

field contents
term file
doc 0
pos 5
term is
doc 0
pos 1
term second
doc 0
pos 3
term test
doc 0
pos 4
term the
doc 0
pos 2
term this
doc 0
pos 0
END


The codec is read/write, and fully functional. All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).

Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc. So don't use this one in production!

But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.

Tuesday, September 28, 2010

Track your home's live electricity usage with Python

Modern electric meters (the one attached to the outside of your house) emit an IR flash for each watt-hour consumed, through the port on the top of the meter. It turns out, it's easy to detect this flash, decode it into "live" power usage, and make cool charts like this:

That's live KW on the Y axis and time on the X axis.

The flash, at least for my meter, appears to have high temporal accuracy, meaning it flashes precisely as you cross 1.000 WH. This is great, since it enables accurate, real-time power usage. For example, when I turn on the lights in my home office, I quickly see the power jump by ~65 watts, and then drop again when I turn the lights off.

This is a fun way to track down which appliances or crazy computers or parasitic drains are using up so much electricity in your house!

I've gone through several designs for this over the years. My last attempt was destroyed when lightning hit my house (there's always a silver lining!), so, I decided to try something new this time and I'm very happy with the approach since it's much simpler, and, moves the pulse detection into Python.

I use a trivial analog circuit to detect the IR pulse, consisting of two 100K resistors in series and an IR photodiode in parallel with one of the resistors. Make sure you get the polarity right on the photodiode otherwise it won't work! Mount the photodiode on your power meter, aligned so that it "sees" each IR pulse. I also added a small (0.01 uF) ceramic capacitor in parallel with the same resistor, to suppress transient EM fields otherwise picked up by the relatively long wires out to my meter.

Finally, I use this simple USB audio adapter to move the problem into the digital domain, connecting to the ends of the two series resistors to the mic input to use it for the A/D conversion. This USB audio adapter drives the mic input with ~4.0V bias voltage, which is great since otherwise you'd need an external source.

When there's no pulse, the photo-diode acts roughly like an open switch, meaning there is a fixed 200K resistive load on the mic input. When a pulse happens (mine seem to last for ~10 msec), the photo-diode acts roughly like a closed switch, suddenly dropping the series resistance to 100K. This drop causes a very detectible pulse (strongly negative then strongly positive) on the mic input's voltage, I suspect because there's a capacitor behind the bias voltage (but: I am no analog circuit engineer, so this is speculation!).

You can plug this USB audio adapter into any computer; I use a Sheeva plug computer (delightful device, very low power -- I have three!). I record the digital samples (arecord works well, at a 2 Khz rate) and decode the samples in Python to detect a pulse whenever the value drops below -1000. You can easily compute the live KW based on the time between two adjacent pulses, push this into a database, and build graphs on top of this using Google's visualization APIs (I use dygraphs).

My last approach didn't have nearly the temporal accuracy (ie, it smoothed heavily across time), which masks interesting things. For example, now I can tell the difference between a resistive load (the coffee maker, oven, crockpot) and an inductive load (refrigerator compressor, vacuum cleaner) because the inductive load has a huge spike in the beginning, as the motor consumes lots of power trying to spin up.

Friday, September 17, 2010

Lucene's indexing is fast!


Wikipedia periodically exports all of the content on their site, providing a nice corpus for performance testing. I downloaded their most recent English XML export: it uncompresses to a healthy 21 GB of plain text! Then I fully indexed this with Lucene's current trunk (to be 4.0): it took 13 minutes and 9 seconds, or 95.8 GB/hour -- not bad!

Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost. Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc. I use Lucene's contrib/benchmark package to do the indexing; here's the alg I used:

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

content.source = org.apache.lucene.benchmark.byTask.feeds.LineDocSource
docs.file = /lucene/enwiki-20100904-pages-articles.txt

doc.stored = true
doc.term.vector = false
doc.tokenized = false
doc.body.stored = false
doc.body.tokenized = true

log.step.AddDoc=10000

directory=FSDirectory
compound=false
ram.flush.mb = 256

work.dir=/lucene/indices/enwiki

content.source.forever = false

CreateIndex

{ "BuildIndex"
[ { "AddDocs" AddDoc > : * ] : 6
- CloseIndex
}

RepSumByPrefRound BuildIndex

There is no field truncation taking place, since this is now disabled by default -- every token in every Wikipedia article is being indexed. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them. I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. The index only has 4 fields -- title, date, body, and docid.

I've done a few things to speed up the indexing:
  • Increase IndexWriter's RAM buffer from the default 16 MB to 256 MB
  • Run with 6 threads
  • Disable compound file format
  • Reuse document/field instances (contrib/benchmark does this by default)
Lucene's wiki describes additional steps you can take to speed up indexing.

Both the source lines file and index are on an Intel X25-M SSD, and I'm running it on a modern machine, with dual Xeon X5680s, overclocked to 4.0 Ghz, with 12 GB RAM, running Fedora Linux. Java is 64bit 1.6.0_21-b06, and I run as java -server -Xmx2g -Xms2g. I could certainly give it more RAM, but it's not really needed. The resulting index is 6.9 GB.

Out of curiosity, I made a small change to contrib/benchmark, to print the ingest rate over time. It looks like this (over a 100-second window):



Note that a large part (slightly over half!) of the time, the ingest rate is 0; this is not good! This happens because the flushing process, which writes a new segment when the RAM buffer is full, is single-threaded, and, blocks all indexing while it's running. This is a known issue, and is actively being addressed under LUCENE-2324.

Flushing is CPU intensive -- the decode and reencode of the great many vInts is costly. Computers usually have big write caches these days, so the IO shouldn't be a bottleneck. With LUCENE-2324, each indexing thread state will flush its own segment, privately, which will allow us to make full use of CPU concurrency, IO concurrency as well as concurrency across CPUs and the IO system. Once this is fixed, Lucene should be able to make full use of the hardware, ie fully saturate either concurrent CPU or concurrent IO such that whichever is the bottleneck in your context gates your ingest rate. Then maybe we can double this already fast ingest rate!

Monday, September 13, 2010

Fast search filters using flex

A filter in Lucene is a bit set that restricts the search space for any query; you pass it into IndexSearcher's search method. It's effective for a number of use cases, such as document security, index partitions, facet drill-down, etc.

To apply a filter, Lucene must compute the intersection of the documents matching the query against the documents allowed by the filter. Today, we do that in IndexSearcher like this:

while (true) {
if (scorerDoc == filterDoc) {
// Check if scorer has exhausted, only before collecting.
if (scorerDoc == DocIdSetIterator.NO_MORE_DOCS) {
break;
}
collector.collect(scorerDoc);
filterDoc = filterIter.nextDoc();
scorerDoc = scorer.advance(filterDoc);
} else if (scorerDoc > filterDoc) {
filterDoc = filterIter.advance(scorerDoc);
} else {
scorerDoc = scorer.advance(filterDoc);
}
}

We call this the "leapfrog approach": the query and the filter take turns trying to advance to each other's next matching document, often jumping past the target document. When both land on the same document, it's collected.

Unfortunately, for various reasons this implementation is inefficient (these are spelled out more in LUCENE-1536):
  1. The advance method for most queries is costly.
  2. The advance method for most filters is usually cheap.
  3. If the number of documents matching the query is far higher than
    the number matching the filter, or vice versa, it's better to drive
    the matching by whichever is more restrictive.
  4. If the filter supports fast random access, and is not super
    sparse, it's better to apply it during postings enumeration, like
    deleted docs.
  5. Query scorers don't have a random access API, only .advance(),
    which does unecessary extra work .next()'ing to the next matching
    document.
To fix this correctly, Lucene really needs an optimization stage, much like a modern database, which looks at the type and structure of the query and filter, as well as estimates of how many documents will match, and then picks an appropriate execution path. Likely this will be coupled with source code specilization/generation (LUCENE-1594) to write dedicated java code to execute the chosen path. Some day we'll get to that point!

Until then, there in a simple way to get a large speedup in many cases, addressing the 4th issue above. Prior to flexible indexing, when you obtained the postings enumeration for documents matching a given term, Lucene would silently filter out deleted documents. With flexible indexing, the API now allows you to pass in a bit set marking the documents to skip. Normally you'd pass in the IndexReader's deleted docs. But, with a simple subclass of FilterIndexReader, it's possible to use any filter as the documents to skip.

To test this, I created a simple class, CachedFilterIndexReader (I'll attach it to LUCENE-1536). You pass it an existing IndexReader, plus a Filter, and it creates an IndexReader that filters out both deleted documents and documents that don't match the provided filter. Basically, it compiles the IndexReader's deleted docs (if any), and the negation of the incoming filter, into a single cached bit set, and then passes that bit set as the skipDocs whenever postings are requested. You can then create an IndexSearcher from this reader, and all searches against it will be filtered according to the filter you provided.

This is just a prototype, and has certain limitations, eg it doesn't implement reopen, it's slow to build up its cached filter, etc.

Still, it works very well! I tested it on a 10M Wikipedia index, with a random filter accepting 50% of the documents:

QueryQPS DefaultQPS Flex% change
united~0.719.9519.25-3.5%
un*d43.1952.2120.9%
unit*21.5330.5241.8%
"united states"6.128.7442.9%
+united +states9.6814.2347.0%
united states7.7114.5688.9%
states15.7336.05129.2%

I'm not sure why the fuzzy query got a bit slower, but the speedups on the other queries are awesome. However, this approach is actually slower if the filter is very sparse. To test this, I ran just the TermQuery ("states"), against different filter densities:

The cutover, for TermQuery at least, is somewhere around 1.1%, meaning if the filter accepts more than 1.1% of the index, it's best to use the CachedFilterIndexReader class; otherwise it's best to use Lucene's current implementation.

Thanks to this new flex API, until we can fix Lucene to properly optimize for filter and query intersection, this class gives you a viable, fully external, means of massive speedups for non-sparse filters!

Tuesday, August 24, 2010

Here comes another bubble

Hilarious. It's a spoof of Billy Joel's catchy We Didn't Start the Fire. Scary that this was posted more than two and a half years ago!

Saturday, August 7, 2010

Do babies really sleep?

Those of you with kids have undoubtedly discovered that sometimes they sleep with their eyes half open.

The first time it happens it's somewhat alarming, but, then, you get used to it and realize it's "normal". It probably just means their eyelids haven't quite closed yet.

Well, my one-year-old did this, yesterday, so I seized the opportunity to do an experiment!! I wanted to test just how asleep she was, so, I smiled like crazy, made funny faces, scrunched my nose, stuck my tongue out, at her half open eye.

And get this: she laughed! In her sleep. I repeated the test 3 times and each time she laughed. Yet, she was most definitely "asleep" -- the eyelid eventually closed and she didn't wake up for another hour.

I guess enough of the brain remains active to process what the eye is seeing even when the rest of the brain is sleeping? Wild!

Before you think I'm being unfair, she also conducts sleep experiments on us! Last weekend the whole family went camping. It was bed time and, as usual, we were all exhausted except our one-year-old. There was enough low light so she could see, so, as we were all dropping off one by one, she took turns running back and forth and pulling hard on our chins to wake us up again! Probably, from her perspective, this was like a horror movie, where her family is one by one dropping into comas and it's her job to prevent the disaster!

Kids are great fun!