Tuesday, September 1, 2009

Spell correction

Spell correction is a challenging feature for search engines. Unfortunately, it's also crucial: mis-spelling is rampant when users run searches. In part this is because we all can't remember how to spell, and that's no wonder: the number of English words today is 5X what it was in Shakespeare's time! But it's also because we are simply in a hurry, or, lazy, and make many typos.

I rely on aspell when I'm using emacs. Modern web browsers and word processors check the spelling of all text you enter. Web-side search engines have excellent spell correction; in fact, I no longer bother to correct my typos when entering a search. I've often wondered whether such "crutches" of our modern world are in fact weakening our minds and perhaps causing our language to further evolve? For example, I wonder how Microsoft Word's often wrong (in my experience) grammar checker has crimped "modern" writing.

My Chemistry teacher in high school refused to allow us to use calculators during our tests, for fear that we would lose our ability to do math with only basic tools (paper, pencil, brain, hands). My Physics teacher did the opposite, for the reverse fear that the distraction of doing basic math would take precious time and thought away from actually thinking about how to solve the problems. Who's right?

Google clearly sets the gold standard for respelling, that any search engine is now required to live up to. If you don't match that high bar, users are automatically disappointed. And you really don't want to disappoint your users: it's nearly impossible to get them to try out your new application, and, they often don't give second chances.

For most approaches to spell correction, the more data you throw at them the better they perform. If you have lots of queries coming in, you can use that as your sole source. Google of course has tons of queries to tap into. If you are less fortunate, you can use your index/documents as your source. Both of these approaches assume most people know how to spell well! The assumption seems to hold, for now, but I have to wonder, as we all lean on this crutch and become worse at spelling with time, won't this eventually undermine Google's approach? No worries; Google will adapt. This is not unlike investing in index funds: that approach only works well if relatively few people do it.

Lucene's basic spellchecker package, under contrib, which requires you to provide a Dictionary of "known words", allows you to derive these words from your search index. It has some limitations: it can only do context-free correction (one word at once, independent of all other words in the query); it doesn't take word frequency in the index into account when deriving the index (so if a typo gets into your index, which can easily happen, you could end up suggesting that typo!); etc. But it does provide a pluggable distance measure for picking the best candidate. It's a good start.

One particularly sneaky feature to get right is spell correction in the context of entitlements; my post this morning on Lucene's user list raises this problem in a real use case (single index to search multiple user's emails). Entitlements means restricting access for certain users to certain documents. For example, you could have a large search index containing all documents from your large intranet, but because of security on the intranet, only certain users are allowed to access certain documents.

Lucene makes it easy to implement entitlements during searching, by using static (based solely on what's indexed) or dynamic (based on some "live" external source at search time) filtering.

However, properly doing spell correction in the presence of entitlements is dangerous. If you build a global lexicon based on your index, that lexicon can easily "bleed" entitlements when there are terms that only occur in documents from one entitlement class. This might be acceptable for context-free spell correction, but if your spell correction has context (can suggest whole phrases at a time) you could easily bleed a very dangerous phrase (eg, "Bob was fired") by accident.

So, you might choose to splinter your spell correction dictionary by user class, but that could result in far too little data per user class. I'm not sure how to solve it, well; it's a challenging problem.

I hope I haven't mis-spelled any words here!

No comments:

Post a Comment