Friday, September 28, 2012

Lucene's new analyzing suggester

Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago.

In Lucene we have several different suggest implementations, under the suggest module; today I'm describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).

To use it, you provide the set of suggest targets, which is the full set of strings and weights that may be suggested. The targets can come from anywhere; typically you'd process your query logs to create the targets, giving a higher weight to those queries that appear more frequently. If you sell movies you might use all movie titles with a weight according to sales popularity.

You also provide an analyzer, which is used to process each target into analyzed form. Under the hood, the analyzed form is indexed into an FST. At lookup time, the incoming query is processed by the same analyzer and the FST is searched for all completions sharing the analyzed form as a prefix.

Even though the matching is performed on the analyzed form, what's suggested is the original target (i.e., the unanalyzed input). Because Lucene has such a rich set of analyzer components, this can be used to create some useful suggesters:
  • With an analyzer that folds or normalizes case, accents, etc. (e.g., using ICUFoldingFilter), the suggestions will match irrespective of case and accents. For example, the query "ame..." would suggest Amélie.

  • With an analyzer that removes stopwords and normalizes case, the query "ghost..." would suggest "The Ghost of Christmas Past".

  • Even graph TokenStreams, such as SynonymFilter, will work: in such cases we enumerate and index all analyzed paths into the FST. If the analyzer recognizes "wifi" and "wireless network" as synonyms, and you have the suggest target "wifi router" then the user query "wire..." would suggest "wifi router".

  • Japanese suggesters may now be possible, with an analyzer that copies the reading (ReadingAttribute in the Kuromoji analyzer) as its output.

Given the diversity of analyzers, and the easy extensibility for applications to create their own analyzers, I'm sure there are many interesting use cases for this new AnalyzingSuggester: if you have an example please share with us on Lucene's user list (java-user@lucene.apache.org).

While this is a great step forward, there's still plenty to do with Lucene's suggesters. We need to allow for fuzzy matching on the query so we're more robust to typos (there's a rough prototype patch on LUCENE-3846). We need to predict based on only part of the query, instead of insisting on a full prefix match. There are a number of interesting elements to Google's autosuggest that we could draw inspiration from. As always, patches welcome!

22 comments:

  1. Thank you for this article! I've just started using Lucene's suggesters to implement autocomplete feature in my project. So far everything works great.

    Though, I'd be thankful if you could tell me if AnalyzingSuggester and FuzzySuggester are thread-safe when using lookup() method? I couldn't find this information anywhere.

    Kind regards

    ReplyDelete
  2. Can you please point me to an example?

    ReplyDelete
  3. Hi Aditya,

    I don't know of any examples ... but maybe look at its unit test? https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_1/lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingSuggesterTest.java

    ReplyDelete
  4. Hello Mike,
    I've been playing around with this the last days and I think I almost got it working - maybe you know what needs to be changed :)

    I have a copyField "asug", which copies from "name" (this has accents in it), it is a custom fieldType with a KeywordTokenizer (since I want the whole term to be returned), lowercase and ascii-folding, for both index and query.

    It seems that it always returns the indexed value, rather than the actual field value - so "ame.." gives me back "amelie".

    I randomly got the right value back by feeding without index-analyzers and restarting with index-analyzers. Any ideas?

    ReplyDelete
    Replies
    1. Now that I think about it, it makes sense why that didn't work ... I was quering on a non-stored, indexed (copy)field. So that's obviously the reason I got back the indexed value.

      Delete
  5. Hi Sebastian,

    That's very odd ... it should always return the original field value, not the analyzed form. Can you make a small set of names showing the issue?

    ReplyDelete
    Replies
    1. Hey Mike,

      here's a pastebin: http://pastebin.com/20vSGJ1a

      After that I feed the document and do:
      http://localhost:8080/solr/wiki/autosuggest?q=asug:test&spellcheck.build=true

      I get the "right outcome" for every possible query I tried, e.g. Têst, tést, TÈST,... Only problem is, that this seems to return the stored index value ("test name"), rather than the stored field value ("Têst Námè").

      Thanks!

      Delete
  6. Hi Sebastian,

    Can you send an email to solr-user@lucene.apache.org with these details? I'm not sure what's going on. That test case sure looks like it should work (ie return Têst Námè not test name).

    ReplyDelete
  7. So to clarify it, AnalyzerSuggester needs to be used on the field directly (no copyField or something), then it's working :)
    So to query on the field "name" with filters specified in "text_asug" (lowercase, ascii,...) one would use:

    http://pastebin.com/tN9yXHB0

    ReplyDelete
  8. Phew, thanks for bringing closure Sebastian!

    ReplyDelete
  9. http://luceneautosuggester-lucene.rhcloud.com/.. Sample lucene auto suggester demo using analyzing suggester and fuzzy suggester.. Lucene autosuggester is awesome...

    ReplyDelete
    Replies
    1. Hello, Puneet

      Thanks for the great demo.
      I am confuse for the Japanese autocomplete and I see your demo can support the Japanese autocomplete.
      Could you please share the configurations?
      Thanks a lot.

      And, Mike, thanks for this article.

      Delete
    2. Thanks for very good sample :)

      Delete
  10. Puneet,

    Nice demo! Thanks for sharing.

    ReplyDelete
  11. Any idea if the source is available for this demo?

    ReplyDelete
  12. I'm also curious about source code. I'm having trouble figuring out how to load an existing index to have the Suggester use (if that's even possible).

    ReplyDelete
    Replies
    1. If you have suggestions in your index, as e.g. text and weight etc. as stored fields in your documents, you can use the DocumentDictionary class to enumerate the suggestions from your documents. You pass that to AnalyzingSuggester.build to build the suggester.

      Delete
  13. How can the AnalyzerSuggester return the original fieldvalue if the lookup is done by the analyzed value? Dependiing on the analyzer this is not a unique operation.

    ReplyDelete
    Replies
    1. The surface form is separately stored (as an FST output), even though matching is done based on the analyzed form.

      Delete
  14. Can you comment on ThreadSafety? Can we just instantiate a Suggester and then build, lookup, etc against a stored instance?

    ReplyDelete
    Replies
    1. Hi mschipperheyn,

      Could you ask this on Lucene's user list (java-user@lucene.apache.org)?

      Delete