Thursday, March 31, 2011

A login-wall is nearly as bad as a pay-wall!

Much has been said and asked about the differences between Stack Overflow and Quora.

And, while there are deep and interesting differences, such as how Stack Overflow makes reputation tracking and badges explicit, in my opinion, one simple difference is the most important of all: Quora's login-wall.

See, you cannot do anything with Quora until you've registered, while with Stack Overflow you can do almost everything without registering. They are polar opposites!

Like everyone else, I have too much curiosity and too little time. I try to keep up on Hacker News (sorry Digg and Reddit): I click through to the cool stuff, and then move on. You have one precious first page impression to rope me in, so don't spend that impression with a login-wall!

I mean, sure, I'm still going to go link up my Facebook account so I can login to Quora and see the questions, answers, conversations. (And, yes, Facebook seems to be winning at the "universal ID" game, even though I like OpenID better.) Still, for each persistent user like me, you've lost 9 non-persistent ones with that dreaded login-wall.

Remember: if you are are a new cool Web site, gaining value from the network effect (as all social sites do), trying to eek out just a tiny slice of all these fickle users jumping around out here, don't put up a login-wall! It's just about as bad as a paywall. Let brand new users do as much as possible with your site, and make that very first page impression count.

Saturday, March 26, 2011

Your test cases should sometimes fail!

I'm an avid subscriber of the delightful weekly (sometimes) Python-URL! email, highlighting the past week's interesting discussions across the numerous Python lists. Each summary starts with the best quote from the week; here's last week's quote:
"So far as I know, that actually just means that the test suite is insufficient." - Peter Seebach, when an application passes all its tests.
I wholeheartedly agree: if your build always passes its tests, that means your tests are not tough enough! Ideally the tests should stay ahead of the software, constantly pulling you forwards to improve its quality. If the tests keep passing, write new ones that fail! Or make existing ones evil-er.

You'll be glad to know that Lucene/Solr's tests do sometimes fail, as you can see in the Hudson Jenkins automated trunk builds.


Randomized testing

Our test infrastructure has gotten much better, just over the past 6 months or so, through heavy use of randomization.

When a test needs a Directory instance, but doesn't care which, it uses the newDirectory method. This method picks one of Lucene's Directory implementations (RAMDirectory, NIOFSDirectory, MMapDirectory, etc.) and then wraps it with MockDirectoryWrapper, a nice little class that does all sorts of fun things like: occasionally calling Thread.yield; preventing still-open files from being overwritten or deleted (acts-like-Windows); refusing to write to the same file twice (verifying Lucene is in fact write-once); breaking up a single writeBytes into multiple calls; optionally throwing IOException on disk full, or simply throwing exceptions at random times; simulating an OS/hardware crash by randomly corrupting un-sync'd files in devilish ways; etc. We pick a timezone and locale.

To randomize indexing, we create a IndexWriterConfig, tweaking all sorts of settings, and use RandomIndexWriter (like IndexWriter, except it sometimes optimizes, commits, yields, etc.). The newField method enables or disables stored fields and term vectors. We create random codecs, per field, by combining a terms dictionary with a random terms index and postings implementations. MockAnalyzer injects payloads into its tokens.

Sometimes we use the PreFlex codec, to writes all indices in the 3.x format (so that we test index backwards compatibility), and sometimes the nifty SimpleText codec. We have exotic methods for creating random yet somewhat realistic full Unicode strings. When creating an IndexSearcher, we might use threads (pass an ExecutorService), or not. We catch tests that leave threads running, or that cause insanity in the FieldCache (for example by loading both parent and sub readers).


Reproducibility

To ensure a failure is reproducible, we save the random seeds and on a failure print out a nice line like this:
NOTE: reproduce with: ant test -Dtestcase=TestFieldCacheTermsFilter -Dtestmethod=testMissingTerms -Dtests.seed=-1046382732738729184:5855929314778232889
This fixes the seed so that the test runs deterministically. Sometimes, horribly, we have bugs in this seed logic, thus causing tests to not run deterministically and we scramble to fix those bugs first!

If you happen to hit a test failure, please send that precious line to the dev list! This is like the Search for Extraterrestrial Intelligence (SETI): there are some number of random seeds out there (hopefully, not too many!), that will lead to a failure, and if your computer is lucky enough to discover one of these golden seeds, please share the discovery!

The merging of Lucene and Solr's development was also a big step forward for test coverage, since every change in Lucene is now tested against all of Solr's test cases as well.

Tests accept a multiplier to crank things up, causing them to use more test documents or iterations, run for longer time, etc. We now have perpetual jobs on Jenkins, for both 3.x and trunk, launching every 15 minutes with multiplier 5. We know quickly when someone breaks the build!

This added test coverage has already caught a number of sneaky bugs (including a rare index corruption case on disk-full and a chunking bug in MMapDirectory) that we otherwise would not have discovered for some time.

The test infrastructure itself is so useful that it's now been factored out as a standalone JAR so apps using Lucene can tap into it to create their own fun randomized tests.

Thursday, March 24, 2011

Lucene's FuzzyQuery is 100 times faster in 4.0

There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to FuzzyQuery really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.

FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.

The QueryParser syntax is term~ or term~N, where N is the maximum allowed number of edits (for older releases N was a confusing float between 0.0 and 1.0, which translates to an equivalent max edit distance through a tricky formula).

FuzzyQuery is great for matching proper names: I can search for mcandless~1 and it will match mccandless (insert c), mcandles (remove s), mkandless (replace c with k) and a great many other "close" terms. With max edit distance 2 you can have up to 2 insertions, deletions or substitutions. The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.

Prior to 4.0, FuzzyQuery took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.


The journey begins

The long journey began when Robert Muir had the idea of pre-building a Levenshtein Automaton, a deterministic automaton (DFA) that accepts only the terms within edit distance N. Doing this, up front, and then intersecting that automaton with the terms in the index, should give a massive speedup, he reasoned.

At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to N insertions, deletions and substitutions. But, unfortunately, just building that DFA (let alone then intersecting it with the terms in the index), was too slow.

Fortunately, after some Googling, he discovered a paper, by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr committers!) detailing an efficient algorithm for building the Levenshtein Automaton from a given base term and max edit distance. All he had to do is code it up! It's just software after all. Somehow, he roped Mark Miller, another Lucene/Solr committer, into helping him do this.

Unfortunately, the paper was nearly unintelligible! It's 67 pages, filled with all sorts of equations, Greek symbols, definitions, propositions, lemmas, proofs. It uses scary concepts like Subsumption Triangles, along with beautiful yet still unintelligible diagrams. Really the paper may as well have been written in Latin.

Much coffee and beer was consumed, sometimes simultaneously. Many hours were spent on IRC, staying up all night, with Mark and Robert carrying on long conversations, which none of the rest of us could understand, trying desperately to decode the paper and turn it into Java code. Weeks went by like this and they actually had made some good initial progress, managing to loosely crack the paper to the point where they had a test implementation of the N=1 case, and it seemed to work. But generalizing that to the N=2 case was... daunting.


The breakthrough

Then, finally, a breakthrough! Robert found, after even more Googling, an existence proof, in an unexpected place: an open-source package, Moman, under the generous MIT license. The author, Jean-Phillipe Barrette-LaPierre, had somehow, incredibly, magically, quietly, implemented the algorithm from this paper. And this was apparently a random side project for him, unrelated to his day job. So now we knew it was possible (and we all have deep admiration for Jean-Phillipe!).

We decided to simply re-use Moman's implementation to accomplish our goals. But, it turns out, its source code is all Python (my favorite programming language)! And, nearly as hairy as the paper itself. Nevertheless, we pushed on.

Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code, to auto-generate Java code containing the necessary tables for each max edit distance case (N=1, N=2, etc.). We had to guess what each Python function did, by its name, trying to roughly match this up to the spooky terminology in the paper.

The result was createLevAutomata.py: it auto-generates crazy looking Java code (see Lev2ParametricDescription.java, and scroll to the cryptic packed tables at the bottom), which in turn is used by further Java code to create the Levenshtein automaton per-query. We only generate the N=1 and N=2 cases (the N>=3 cases aren't really practical, at least not yet).


The last bug...

Realize, now, what a crazy position we were in. We wrote our own scary Python code, tapping into various functions in the Moman package, to auto-generate unreadable Java code with big tables of numbers, which is then used to generate Levenshtein automata from the base term and N. We went through many iterations with this crazy chain of Python and Java code that we barely understood, slowly iterating to get the bugs out.

After fixing many problems, we still had one persistent bug which we just couldn't understand, let alone fix. We struggled for several days, assuming the bug was in our crazy Python/Java chain. Finally, we considered the possibility that the bug was in Moman, and indeed Robert managed to reduce the problem to a tiny Python-only case showing where Moman failed to match the right terms. Robert sent this example to Jean-Phillipe, who quickly confirmed the bug and posted a patch the next day. We applied his patch and suddenly everything was working perfectly!

Fortunately, while this fast FuzzyQuery was unbelievably hairy to implement, testing it well is relatively easy since we can validate it against the brute-force enumeration from 3.0. We have several tests verifying the different layers executed by the full FuzzyQuery. The tests are exhaustive in that they test all structurally different cases possible in the Levenshtein construction, using a binary (only characters 0 and 1) terms.

Beyond just solving this nearly impossible task of efficiently compiling a term to a Levenshtein Automaton, we had many other parts to fill in. For example, Robert separately created a general AutomatonQuery, re-using infrastructure from the open-source Brics automaton package, to enable fast intersection of an automaton against all terms and documents in the index. This query is now used to handle WildcardQuery, RegexpQuery, and FuzzyQuery. It's also useful for custom cases, too; for example it's used by Solr to reverse wildcard queries. These slides from Robert describe AutomatonQuery, and its fun possible use case, in more detail.

Separately, we had an impedance mismatch: these automatons speak full unicode (UTF32) characters, yet Lucene's terms are stored in UTF8 bytes, so we had to create a UTF32 -> UTF8 automaton converter, which by itself was also very hairy! That converter translates any UTF32 automaton into an equivalent UTF8 Levenshtein automaton, which can be directly intersected against the terms in the index.

So, today, when you run a FuzzyQuery in 4.0, it efficiently seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton. This, coupled with ongoing performance improvements to seeking and scanning terms, as well as other major improvements like performing MultiTermQuery rewrites per-segment, has given us the astounding overall gains in FuzzyQuery.

Thanks to these enormous performance improvements, Robert has created an entirely new automaton spell checker that uses this same algorithm to find candidate terms for respelling. This is just like FuzzyQuery, except it doesn't visit the matching documents. This is a big improvement over the existing spellchecker as it does not require a separate spellchecker index be maintained.

This whole exciting experience is a great example of why open-source development works so well. Here we have diverse committers from Lucene/Solr, bringing together their various unusual strengths (automatons, Unicode, Python, etc.) to bear on an insanely hard challenge, leveraging other potent open-source packages including Moman and Brics, iterating with the authors of these packages to resolve bugs. No single person involved in this really understands all of the parts; it's truly a team effort.

And now you know what's going on under the hood when you see incredible speedups with FuzzyQuery in 4.0!

[For the not-faint-of-heart, you can browse LUCENE-1606 to see parts of this story unfolding through Jira]

Friday, March 11, 2011

Hack on Lucene this summer!

Are you a student? Looking to do some fun coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has these initial
potential projects
identified, but you can also pick your own; just be sure to discuss with the community first (send an email to dev@lucene.apache.org).