Thursday, January 23, 2014

Finding long tail suggestions using Lucene's new FreeTextSuggester

Lucene's suggest module offers a number of fun auto-suggest implementations to give a user live search suggestions as they type each character into a search box.

For example, WFSTCompletionLookup compiles all suggestions and their weights into a compact Finite State Transducer, enabling fast prefix lookup for basic suggestions.

AnalyzingSuggester improves on this by using an Analyzer to normalize both the suggestions and the user's query so that trivial differences in whitespace, casing, stop-words, synonyms, as determined by the analyzer, do not prevent a suggestion from matching.

Finally, AnalyzingInfixSuggester goes further by allowing infix matches so that words inside each suggestion (not just the prefix) can trigger a match. You can see this one action at the Lucene/Solr Jira search application (e.g., try "python") that I recently created to eat our own dog food. It is also the only suggester implementation so far that supports highlighting (this has proven challenging for the other suggesters).

Yet, a common limitation to all of these suggesters is that they can only suggest from a finite set of previously built suggestions. This may not be a problem if your suggestions are past user queries and you have tons and tons of them (e.g., you are Google). Alternatively, if your universe of suggestions is inherently closed, such as the movie and show titles that Netflix's search will suggest, or all product names on an e-commerce site, then a closed set of suggestions is appropriate.

N-Gram language models

For everyone else, where a high percentage of the incoming queries fall into the never-seen-before long tail, Lucene's newest suggester, FreeTextSuggester, can help! It uses the approach described in this Google blog post.

Rather than precisely matching a previous suggestion, it builds up a simple statistical n-gram language model from all suggestions and looks at the last tokens (plus the prefix of whatever final token the user is typing, if present) to predict the most likely next token.

For example, perhaps the user's query so far is: "flashforge 3d p", and because flashforge is an uncommon brand of 3D printer, this particular suggestion prefix was never added to the suggester. Yet, "3d printer" was a frequently seen phrase in other contexts (different brands). In this case, FreeTextSuggester will see "3d" and the "p" prefix for the next token and predict printer, even though "flashforge 3d printer" was never explicitly added as a suggestion.

You specify the order (N) of the model when you create the suggester: larger values of N require more data to train properly but can make more accurate predictions. All lower order models are also built, so if you specify N=3, you will get trigrams, bigrams and unigrams, all compiled into a single weighted FST for maximum sharing of the text tokens. Of course, larger N will create much larger FSTs. In practice N=3 is the highest you should go, unless you have tons of both suggestions to train and RAM to hold the resulting FST.

To handle sparse data, where a given context (the N-1 prior words) was not seen frequently enough to make accurate predictions, the suggester uses the stupid backoff language model (yes, this is really its name, and yes, it performs well!).

I expect the best way to use this new FreeTextSuggester will be as a fallback: you would first use one of the existing exact match suggesters, but when those suggesters fail to find any suggestions for a given query, because it's "unusual" and has crossed over into the long tail, you then fall back to FreeTextSuggester.

Google seems to use such a modal approach to suggestions as well: if you type "flashforge 3d p" you should see something like this, where each suggestion covers your entire query so far (indeed, Google has heard of the flashforge brand of 3d printer!):



But then if you keep typing and enter "flashforge 3d printer power u", the suggestions change: instead of suggesting an entire query, matching everything I have typed, Google instead suggests the last word or two:



As usual, this feature is very new and likely to contain exciting bugs! See the Jira issue, LUCENE-5214, for details. If you play with this new suggester please start a discussion on the Lucene's user list!

6 comments:

  1. I plan to use AnalyzingInfixSuggester based on docs' title field, and fallback to FreeTextSuggester based on docs' catch all field.
    FreeTextSuggester also sounds great if you want to build a prediction keyboard on mobile to speed up typing. Feed it with enough past email/messages, and vuala it knows your writing style.

    ReplyDelete
  2. Oh, and the post is great! Thx Mike :)

    ReplyDelete
    Replies
    1. Thanks Gili, I'm glad you're planning on using the new suggesters. Send us feedback please!

      Delete
  3. Ravi Kiran BhaskarApril 15, 2014 at 11:05 AM

    Thank you for the wonderful articles on suggesters Mr. McCandless. Can the FreeTextSuggester be used on its own and not as a fallback? I cound not find good documentation or examples on FreeTextSuggester. However I have used the AnalyzingInfixSuggester from 4.7.0 and works fine but for just one weird error (shown below). If I just want to use the indexed field and not use a static dictionary do we still need to configure the directory ?? My config is as shown below













    suggest
    org.apache.solr.spelling.suggest.Suggester
    org.apache.solr.spelling.suggest.fst.AnalyzingInfixLookupFactory
    spelltext
    spellsuggest
    true
    true




    INFO - 2014-04-14 22:39:22.390; org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener; Loading spell index for spellchecker: default
    [#|2014-04-14T22:39:22.390+0000|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=257;_ThreadName=Thread-2;|6021 [searcherExecutor-25-thread-1] ERROR org.apache.solr.core.SolrCore – Exception in reloading spell check index for spellchecker: suggest
    org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/local/domains/xxxx/config/analyzingInfixSuggesterIndexDir.tmp/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.(IndexWriter.java:702)
    at org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.build(AnalyzingInfixSuggester.java:222)
    at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:165)
    at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:141)
    at org.apache.solr.spelling.suggest.Suggester.reload(Suggester.java:173)
    at org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener.newSearcher(SpellCheckComponent.java:715)
    at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1695)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

    Thanks

    Ravi Kiran Bhaskar

    ReplyDelete
    Replies
    1. Hmm, that exception means an AnalyzingInfixSuggester is still open on the directory when you (Solr) tried to open another one.

      You could try using FreeTextSuggester as the only/primary suggester and see how it works ... post feedback back to the list with your experience!

      Delete
    2. I apologize for not getting back to you earlier. I totally forgot about posting my findings on FreeTextSuggester. Both suggesters hang when there is lot of data in the field that you configure the suggester with. We index news articles and if i point the suggesters to the "body" field (1.5 million docs in a single core/collection) the startup just hangs and runs out of memory.

      Another issue i noticed (maybe because I used a "text" field ???) the suggestions are not that great...sometimes when the articles are new, with not much data similar surrounding articles on same context, the suggesters suggest random words which dont relate to the subject at all.

      Hence we had to ditch the suggesters and go with the shingles approach which was giving much more salient long tail suggestions

      Thanks,

      Ravi Kiran Bhaskar

      Delete