Saturday, January 8, 2011

Finite State Transducers, Part 2

In my last post, I described a cool incremental algorithm for building an FST from pre-sorted input/output pairs, and how we're folding it into Lucene.

Progress! The patch on LUCENE-2792 is now committed to Lucene's trunk (eventually 4.0), so that we now use an FST to hold all terms in RAM for the SimpleText codec.

This was a great step forward, and it added the raw low-level infrastructure for building and using FSTs. But, it's really just a toy usage, since the SimpleText codec is not for production use.

I'm happy to report even more progress: we finally have a "real" usage for FSTs in Lucene! With LUCENE-2843 (now committed), we use an FST to hold the terms index in RAM.

Many operations in Lucene require looking up metadata for a requested term. This information includes the number of documents containing the term, file pointers into postings files that actually store the docIDs and positions, etc. Because there could be many unique terms, all this term metadata resides on-disk in the terms dictionary file (_X.tis).

Looking up a term is then a two-step process. First, we consult the terms index, which resides entirely in RAM, to map the requested term to a file pointer in the terms dictionary file. Second, we scan the terms in the on-disk terms dictionary file starting from that file pointer, to look for an exact match.

Certain queries, such as TermQuery, need only one exact term lookup. Others, such as FuzzyQuery or the new, very cool automaton spellchecker, which does not require a separate spellchecker index, perform many term lookups (potentially millions).

Before LUCENE-2843, in trunk, the terms index used packed byte[] and int[] arrays. In fact, this was already a huge improvement over the terms index in 3.0, but is still more wasteful than an FST since each term was stored in fully expanded form, while the FST shares common prefixes and suffixes. Getting this working also required adding separate seekFloor and seekCeil methods to the FSTEnum classes (like a SortedMap<T>).

To test this, I indexed the first 10 million 1KB documents derived from Wikipedia's English database download. The resulting RAM required for the FST was ~38% - 52% smaller (larger segments see more gains, as the FST "scales up" well). Not only is the RAM required much lower, but term lookups are also faster: the FuzzyQuery united~2 was ~22% faster.

This is very much win/win, so we've made this terms index the default for all core codecs (the terms dictionary and terms index are pluggable, so its easy for other codecs to use this as well).

There are many other ways we can use FSTs in Lucene, and we've only just scratched the surface here. In fact, FSTs offers such a sizable RAM reduction that I think for many, but not all, apps it'd be realistic to avoid the two-step term lookup process entirely and simply hold the entire terms dictionary in RAM. This should make term lookup intensive queries potentially much faster, though we'd likely have to rework them to use different algorithms optimized for iterating directly through the terms as an FST.

Friday, December 3, 2010

Using Finite State Transducers in Lucene

FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output. They also look cool:



That FST maps the sorted words mop, moth, pop, star, stop and top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.

Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.

There's a great deal of theory behind FSTs. They generally support the same operations as FSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.

There are some nice general-purpose FST toolkits (OpenFst looks great) that support all these operations, but for Lucene I decided to implement this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.

The resulting implementation (currently a patch on LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a prefix trie, pruning by how many terms come through each node, with even less memory.

Note that because addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.

In the patch I modified the SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!

There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!

Many thanks to Dawid Weiss for helping me iterate on this.

Wednesday, November 3, 2010

Big Fat Fiasco

I just watched this talk by Tom Naughton, describing the mis-steps and bad science over the years that have tricked us all into believing we should avoid fat and cholesterol in our diet when in fact we should be doing the exact opposite!

Tom is an excellent speaker (he's a comedian!), mixing in humor in what is at heart a very sad topic. He details the science that should have led us to conclude that excessive carbs and sugar consumption are in fact the root cause behind heart disease, diabetes and obesity, not fat and cholesterol as we've been incessantly told over the years.

He shows how the "Fat Lazy Slob Theory" (also called "calories in minus calories out") so frequently put forth to explain weight gain is in fact wrong, and that instead the root cause is the biochemistry in your body driving you to eat when your blood sugar is too high.

Tom's documentary movie Fat Head, which is getting great reviews on Amazon, delves into these same topics. I haven't watched it yet but I plan to.

So enjoy your butter, cheese, whole milk, eggs (with yolk!!) and cut back on sugars and carbs. And don't drink soda!

Monday, October 25, 2010

Our medical system is a house of cards

I just came across this great article about meta-researcher Dr. John Ioannidis. Here's the summary:

Much of what medical researchers conclude in their studies is misleading, exaggerated, or flat-out wrong. So why are doctors—to a striking extent—still drawing upon misinformation in their everyday practice? Dr. John Ioannidis has spent his career challenging his peers by exposing their bad science.

The gist is that modern medical research is deeply flawed and biased such that the "conclusions" that you and I eventually read in the news headlines are often false. I especially love his advice for us all:

Ioannidis suggests a simple approach: ignore them all

This is in fact my approach! I have a simple rule: if it tastes good it's good for you. So I eat plenty of fat, salt, sugar, cholesterol, carbs, etc. I love eggs and cheese and I always avoid low-fat or low-cholesterol foods. I get lots of sun and never use sun screen. I drink coffee and beer, daily. I drink lots of water. I get daily exercise, running and walking. And I avoid hand sanitizers like Purell (I believe commonplace dirt/germs are in fact natural and good for you). I strongly believe humans do not need pills to stay healthy. I don't take a daily vitamin. And I'm very healthy!

This short interview between Discover Magazine and Harvard clinician John Abramson echoes the same core problem. Here's a choice quote:

When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.

Unfortunately, the medical world has a deep, deep conflict of interest: healthy people do not generate profits. Capitalism is a horrible match to health care.

So, next time your doctor prescribes a fancy new cool-sounding powerful drug like Alevia or Omosia or Nanotomopia or whatever, try to remember that our medical system is really built on a house of cards. Your doctor, let alone you, cannot possibly differentiate what's true from what's false. Don't trust that large triple-blind random controlled trial that supposedly validated this cool new drug. You are the guinea pig! And it's only when these drugs cause all sorts of problems once they are really tested on the population at large that their true colors are revealed.

Sunday, October 17, 2010

Pics from BBQ after Lucene Revolution

I finally pulled the pics off my camera from last week's BBQ after Lucene Revolution in Boston, where much fun was had! See them here. It was awesome to finally meet everyone!

Saturday, October 9, 2010

Fun with flexible indexing

The Lucene Revolution conference just wrapped up yesterday. It was well attended (~300 or so people). It was great fun to hear about all the diverse ways that Lucene and Solr are being used in the real world.

I gave a talk about flexible indexing, coming in the next major release of Lucene (4.0). Slides are here.

Tuesday, October 5, 2010

Lucene's SimpleText codec

Inspired by this question on the Lucene user's list, I created a new codec in Lucene called the SimpleText codec. The best ideas come from the user's lists!

This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0). Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...

Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:

field contents
term file
doc 0
pos 5
term is
doc 0
pos 1
term second
doc 0
pos 3
term test
doc 0
pos 4
term the
doc 0
pos 2
term this
doc 0
pos 0
END


The codec is read/write, and fully functional. All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).

Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc. So don't use this one in production!

But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.

Tuesday, September 28, 2010

Track your home's live electricity usage with Python

Modern electric meters (the one attached to the outside of your house) emit an IR flash for each watt-hour consumed, through the port on the top of the meter. It turns out, it's easy to detect this flash, decode it into "live" power usage, and make cool charts like this:

That's live KW on the Y axis and time on the X axis.

The flash, at least for my meter, appears to have high temporal accuracy, meaning it flashes precisely as you cross 1.000 WH. This is great, since it enables accurate, real-time power usage. For example, when I turn on the lights in my home office, I quickly see the power jump by ~65 watts, and then drop again when I turn the lights off.

This is a fun way to track down which appliances or crazy computers or parasitic drains are using up so much electricity in your house!

I've gone through several designs for this over the years. My last attempt was destroyed when lightning hit my house (there's always a silver lining!), so, I decided to try something new this time and I'm very happy with the approach since it's much simpler, and, moves the pulse detection into Python.

I use a trivial analog circuit to detect the IR pulse, consisting of two 100K resistors in series and an IR photodiode in parallel with one of the resistors. Make sure you get the polarity right on the photodiode otherwise it won't work! Mount the photodiode on your power meter, aligned so that it "sees" each IR pulse. I also added a small (0.01 uF) ceramic capacitor in parallel with the same resistor, to suppress transient EM fields otherwise picked up by the relatively long wires out to my meter.

Finally, I use this simple USB audio adapter to move the problem into the digital domain, connecting to the ends of the two series resistors to the mic input to use it for the A/D conversion. This USB audio adapter drives the mic input with ~4.0V bias voltage, which is great since otherwise you'd need an external source.

When there's no pulse, the photo-diode acts roughly like an open switch, meaning there is a fixed 200K resistive load on the mic input. When a pulse happens (mine seem to last for ~10 msec), the photo-diode acts roughly like a closed switch, suddenly dropping the series resistance to 100K. This drop causes a very detectible pulse (strongly negative then strongly positive) on the mic input's voltage, I suspect because there's a capacitor behind the bias voltage (but: I am no analog circuit engineer, so this is speculation!).

You can plug this USB audio adapter into any computer; I use a Sheeva plug computer (delightful device, very low power -- I have three!). I record the digital samples (arecord works well, at a 2 Khz rate) and decode the samples in Python to detect a pulse whenever the value drops below -1000. You can easily compute the live KW based on the time between two adjacent pulses, push this into a database, and build graphs on top of this using Google's visualization APIs (I use dygraphs).

My last approach didn't have nearly the temporal accuracy (ie, it smoothed heavily across time), which masks interesting things. For example, now I can tell the difference between a resistive load (the coffee maker, oven, crockpot) and an inductive load (refrigerator compressor, vacuum cleaner) because the inductive load has a huge spike in the beginning, as the motor consumes lots of power trying to spin up.

Friday, September 17, 2010

Lucene's indexing is fast!


Wikipedia periodically exports all of the content on their site, providing a nice corpus for performance testing. I downloaded their most recent English XML export: it uncompresses to a healthy 21 GB of plain text! Then I fully indexed this with Lucene's current trunk (to be 4.0): it took 13 minutes and 9 seconds, or 95.8 GB/hour -- not bad!

Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost. Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc. I use Lucene's contrib/benchmark package to do the indexing; here's the alg I used:

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer

content.source = org.apache.lucene.benchmark.byTask.feeds.LineDocSource
docs.file = /lucene/enwiki-20100904-pages-articles.txt

doc.stored = true
doc.term.vector = false
doc.tokenized = false
doc.body.stored = false
doc.body.tokenized = true

log.step.AddDoc=10000

directory=FSDirectory
compound=false
ram.flush.mb = 256

work.dir=/lucene/indices/enwiki

content.source.forever = false

CreateIndex

{ "BuildIndex"
[ { "AddDocs" AddDoc > : * ] : 6
- CloseIndex
}

RepSumByPrefRound BuildIndex

There is no field truncation taking place, since this is now disabled by default -- every token in every Wikipedia article is being indexed. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them. I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. The index only has 4 fields -- title, date, body, and docid.

I've done a few things to speed up the indexing:
  • Increase IndexWriter's RAM buffer from the default 16 MB to 256 MB
  • Run with 6 threads
  • Disable compound file format
  • Reuse document/field instances (contrib/benchmark does this by default)
Lucene's wiki describes additional steps you can take to speed up indexing.

Both the source lines file and index are on an Intel X25-M SSD, and I'm running it on a modern machine, with dual Xeon X5680s, overclocked to 4.0 Ghz, with 12 GB RAM, running Fedora Linux. Java is 64bit 1.6.0_21-b06, and I run as java -server -Xmx2g -Xms2g. I could certainly give it more RAM, but it's not really needed. The resulting index is 6.9 GB.

Out of curiosity, I made a small change to contrib/benchmark, to print the ingest rate over time. It looks like this (over a 100-second window):



Note that a large part (slightly over half!) of the time, the ingest rate is 0; this is not good! This happens because the flushing process, which writes a new segment when the RAM buffer is full, is single-threaded, and, blocks all indexing while it's running. This is a known issue, and is actively being addressed under LUCENE-2324.

Flushing is CPU intensive -- the decode and reencode of the great many vInts is costly. Computers usually have big write caches these days, so the IO shouldn't be a bottleneck. With LUCENE-2324, each indexing thread state will flush its own segment, privately, which will allow us to make full use of CPU concurrency, IO concurrency as well as concurrency across CPUs and the IO system. Once this is fixed, Lucene should be able to make full use of the hardware, ie fully saturate either concurrent CPU or concurrent IO such that whichever is the bottleneck in your context gates your ingest rate. Then maybe we can double this already fast ingest rate!

Monday, September 13, 2010

Fast search filters using flex

A filter in Lucene is a bit set that restricts the search space for any query; you pass it into IndexSearcher's search method. It's effective for a number of use cases, such as document security, index partitions, facet drill-down, etc.

To apply a filter, Lucene must compute the intersection of the documents matching the query against the documents allowed by the filter. Today, we do that in IndexSearcher like this:

while (true) {
if (scorerDoc == filterDoc) {
// Check if scorer has exhausted, only before collecting.
if (scorerDoc == DocIdSetIterator.NO_MORE_DOCS) {
break;
}
collector.collect(scorerDoc);
filterDoc = filterIter.nextDoc();
scorerDoc = scorer.advance(filterDoc);
} else if (scorerDoc > filterDoc) {
filterDoc = filterIter.advance(scorerDoc);
} else {
scorerDoc = scorer.advance(filterDoc);
}
}

We call this the "leapfrog approach": the query and the filter take turns trying to advance to each other's next matching document, often jumping past the target document. When both land on the same document, it's collected.

Unfortunately, for various reasons this implementation is inefficient (these are spelled out more in LUCENE-1536):
  1. The advance method for most queries is costly.
  2. The advance method for most filters is usually cheap.
  3. If the number of documents matching the query is far higher than
    the number matching the filter, or vice versa, it's better to drive
    the matching by whichever is more restrictive.
  4. If the filter supports fast random access, and is not super
    sparse, it's better to apply it during postings enumeration, like
    deleted docs.
  5. Query scorers don't have a random access API, only .advance(),
    which does unecessary extra work .next()'ing to the next matching
    document.
To fix this correctly, Lucene really needs an optimization stage, much like a modern database, which looks at the type and structure of the query and filter, as well as estimates of how many documents will match, and then picks an appropriate execution path. Likely this will be coupled with source code specilization/generation (LUCENE-1594) to write dedicated java code to execute the chosen path. Some day we'll get to that point!

Until then, there in a simple way to get a large speedup in many cases, addressing the 4th issue above. Prior to flexible indexing, when you obtained the postings enumeration for documents matching a given term, Lucene would silently filter out deleted documents. With flexible indexing, the API now allows you to pass in a bit set marking the documents to skip. Normally you'd pass in the IndexReader's deleted docs. But, with a simple subclass of FilterIndexReader, it's possible to use any filter as the documents to skip.

To test this, I created a simple class, CachedFilterIndexReader (I'll attach it to LUCENE-1536). You pass it an existing IndexReader, plus a Filter, and it creates an IndexReader that filters out both deleted documents and documents that don't match the provided filter. Basically, it compiles the IndexReader's deleted docs (if any), and the negation of the incoming filter, into a single cached bit set, and then passes that bit set as the skipDocs whenever postings are requested. You can then create an IndexSearcher from this reader, and all searches against it will be filtered according to the filter you provided.

This is just a prototype, and has certain limitations, eg it doesn't implement reopen, it's slow to build up its cached filter, etc.

Still, it works very well! I tested it on a 10M Wikipedia index, with a random filter accepting 50% of the documents:

QueryQPS DefaultQPS Flex% change
united~0.719.9519.25-3.5%
un*d43.1952.2120.9%
unit*21.5330.5241.8%
"united states"6.128.7442.9%
+united +states9.6814.2347.0%
united states7.7114.5688.9%
states15.7336.05129.2%

I'm not sure why the fuzzy query got a bit slower, but the speedups on the other queries are awesome. However, this approach is actually slower if the filter is very sparse. To test this, I ran just the TermQuery ("states"), against different filter densities:

The cutover, for TermQuery at least, is somewhere around 1.1%, meaning if the filter accepts more than 1.1% of the index, it's best to use the CachedFilterIndexReader class; otherwise it's best to use Lucene's current implementation.

Thanks to this new flex API, until we can fix Lucene to properly optimize for filter and query intersection, this class gives you a viable, fully external, means of massive speedups for non-sparse filters!

Tuesday, August 24, 2010

Here comes another bubble

Hilarious. It's a spoof of Billy Joel's catchy We Didn't Start the Fire. Scary that this was posted more than two and a half years ago!