Saturday, September 28, 2013

Lucene now has an in-memory terms dictionary, thanks to Google Summer of Code

Last year, Han Jiang's Google Summer of Code project was a big success: he created a new (now, default) postings format for substantially faster searches, along with smaller indices.

This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.

In fact, he created two new terms dictionary implementations. The first, FSTTermsWriter/Reader, hold all terms and metadata in a single in-memory FST, while the second, FSTOrdTermsWriter/Reader, does the same but also supports retrieving the ordinal for a term (TermsEnum.ord()) and looking up a term given its ordinal (TermsEnum.seekExact(long ord)). The second one also uses this ord internally so that the FST is more compact, while all metadata is stored outside of the FST, referenced by ord.

Like the default BlockTree terms dictionary, these new terms dictionaries accept any PostingsBaseFormat so you can separately plug in whichever format you want to encode/decode the postings.

Han also improved the PostingsBaseFormat API so that there is now a cleaner separation of how terms and their metadata are encoded vs. how postings are encoded; PostingsWriterBase.encodeTerm and PostingsReaderBase.decodeTerm now handle encoding and decoding any term metadata required by the postings format, abstracting away how the long[]/byte[] were persisted by the terms dictionary. Previously this line was annoyingly blurry.

Unfortunately, while the performance for primary key lookups is substantially faster, other queries e.g. WildcardQuery are slower; see LUCENE-3069 for details. Fortunately, using PerFieldPostingsFormat, you are free to pick and choose which fields (e.g. your "id" field) should use the new terms dictionary.

For now this feature is trunk-only (eventually Lucene 5.0).

Thank you Han and thank you Google!

Monday, September 16, 2013

Three exciting Lucene features in one day

Three exciting Lucene features in one day

Yesterday was a productive day: suddenly, there are three exciting new features coming to Lucene.

Expressions module

The first feature, committed yesterday, is the new expressions module. This allows you to define a dynamic field for sorting, using an arbitrary String expression. There is builtin support for parsing JavaScript, but the parser is pluggable if you want to create your own syntax.

For example, you could define a sort field using the expression
  sqrt(_score) + ln(popularity)
if you want to offer a blended sort primarily by relevance and boosting by a popularity field.

The code is very easy to use; there are some nice examples in the TestDemoExpressions.java unit test case, and this will be available in Lucene's next stable release (4.6).

Updateable numeric doc-values fields

The second feature, also committed yesterday, is updateable numeric doc-values fields, letting you change previously indexed numeric values using the new updateNumericDocValue method on IndexWriter. It works fine with near-real-time readers, so you can update the numeric values for a few documents and then re-open a new near-real-time reader to see the changes.

The feature is currently trunk only as we work out a few remaining issues involving an particularly controversial boolean. It also currently does not work on sparse fields, i.e. you can only update a document's value if that document had already indexed that field in the first place.

Combined, these two features enable powerful use-cases where you want to sort by a blended field that is changing over time. For example, perhaps you measure how often your users click through each document in the search results, and then use that to update the popularity field, which is then used for a blended sort. This way the rankings of the search results change over time as you learn from users which documents are popular and which are not.

Of course such a feature was always possible before, using custom external code, but with both expressions and updateable doc-values now available it becomes trivial to implement!

Free text suggestions

Finally, the third feature is a new suggester implementation, FreeTextSuggester. It is a very different suggester than the existing ones: rather than suggest from a finite universe of pre-built suggestions, it uses a simple ngram language model to predict the "long tail" of possible suggestions based on the 1 or 2 previous tokens.

Under the hood, it uses ShingleFilter to create the ngrams, and an FST to store and lookup the resulting ngram models. While multiple ngram models are stored compactly in a single FST, the FST can still get quite large; the 3-gram, 2-gram and 1-gram model built on the AOL query logs is 19.4 MB (the queries themselves are 25.4 MB). This was inspired by Google's approach.

Likely this suggester would not be used by itself, but rather as a fallback when your primary suggester failed to find any suggestions; you can see this behavior with Google. Try searching for "the fast and the ", and you will see the suggestions are still full queries. But if the next word you type is "burning" then suddenly google (so far!) does not have a full suggestion and falls back to their free text approach.