Tuesday, March 14, 2017

Apache Lucene 7.0 Is Coming Soon!

The Apache Lucene project will likely release its next major release, 7.0, in a few months!

Remember that Lucene developers generally try hard to backport new features for the next non-major (feature) release, and the upcoming 6.5 already has many great changes, so a new major release is exciting because it means the 7.0-only features, which I now describe, are the particularly big ones that we felt could not be backported for 6.5.

Of course, with every major release, we also do more mundane things like remove deprecated 6.x APIs, and drop support for old indices (written with Lucene 5.x or earlier).

This is only a subset of the new 7.0 only features; for the full list please see the 7.0.0 section in the upcoming CHANGES.txt.

Doc values as iterators

The biggest change in 7.0 is changing doc values from a random access API to a more restrictive iterator API.

Doc values are Lucene's column-stride numeric, sorted or binary per-document field storage across all documents. They can be used to hold scoring signals, such as the single-byte (by default) document length encoding or application-dependent signals, or for sorting, faceting or grouping, or even numeric fields that you might use for range filtering in some queries. Their column-stride storage means it's efficient to visit all values for the one field across documents, in contrast to row-stride storage that stored fields use to retrieve all field values for a single document.

Postings have long been consumed through an iterator, so this was a relatively natural change to make, and the two share the same base class, DocIdSetIterator, to step through or seek to each hit.

The initial rote switch to an iterator API was really just a plumbing swap and less interesting than all the subsequent user-impacting improvements that became possible thanks to the more restrictive API:
With these changes, you finally only pay for what you actually use with doc values, in index size, indexing performance, etc. This is the same as other parts of the index like postings, stored fields, term vectors, etc., and it means users with very sparse doc values no longer see merges taking unreasonably long time or the index becoming unexpectedly huge while merging.

Our nightly sparse benchmarks, based on the NYC Trip Data corpus, show the impressive gains each of the above changes (and more!) accomplished.

Goodbye index-time boosts

Index-time boosting, which lets you increase the a-priori score for a particular document versus other documents, is now deprecated and will be removed in 7.0.

This has always been a fragile feature: it was encoded, along with the field's length, into a single byte value, and thus had very low precision. Furthermore, it is now straightforward to write your custom boost into your own doc values field and use function queries to apply the boost at search time. Finally, with index time boosts gone, length encoding is more accurate, and in particular the first nine length values (1 to 9) are distinct.

Query scoring is simpler

BooleanQuery has long exposed a confusing scoring feature called the coordination factor (coord), to reward hits containing a higher percentage of the search terms. However, this hack is only necessary for scoring models like TF/IDF which have "weak" term saturation such that many occurrences of a single term in a document would be more powerful than adding a single occurence of another term from the query. Since this was specific to one scoring model, TFIDFSimilarity, and since Lucene has now switched to the better Okapi BM25 scoring model by default, we have now fully removed coordination factors in 7.0 from both BooleanQuery and Similarity.

Likewise, the query normalization phase of scoring will be removed. This phase tried to equalize scores across different queries and indices so that they are more comparable, but didn't alter the sort order of hits, and was also TF/IDF specific.

With these scoring simplifications BooleanQuery now makes more aggressive query optimizations when the same sub-clause occurs with different Occur constraints, previously not possible since the scores would change.

Classic Query Parser no longer splits on whitespace

Lucene's original, now called "classic", query parser, always pre-splits the incoming query text on whitespace, and then separately sends those single tokens to the query-time analyzer. This means multi-token filters, such as SynonymGraphFilter or ShingleFilter, will not work.

For example, if the user asks for "denial of service attack" and you had a synonym mapping "denial of service" to DOS, the classic query parser would separately analyze "denial", "of" and "service" so your synonym would never match.

We have already added an option to the query parser to not pre-split on whitespace but left the default unchanged for 6.x releases to preserve backwards compatibility. Finally, with 7.0, we fix that default so that analyzers can see multiple tokens at once, and synonyms will work.

More stuff

As of 7.0 Lucene will (finally!) record into the index metadata which Lucene version was used to originally create it. This knowledge can help us implement future backwards compatibility.

Finite state transducers, used in many ways in Lucene, used to have a complex method call pack which would eek out a few more bytes to further shrink the already small size of the FST. But the code was complex and rarely used and sometimes even made the FST larger so we have removed it for 7.0.

IndexWriter, used to add, update and delete documents in your index, will no longer accept broken token offsets sometimes produced by mis-behaving token filters. Offsets are used for highlighting, and broken offsets, where the end offset for a single token comes before the start offset, or the start offset of a token goes backwards versus the previous token, can only break search-time highlighting. So with this change, Lucene prevents such mistakes at index time by throwing an exception. To ease this transition for cases where users didn't even know their analyzer was producing broken offsets, we've also added a few token filters to "correct" offsets before they are passed to IndexWriter.

Advanced users of Lucene often need to cache something custom for each segment at search time, but the APIs for this are trappy and can lead to unexpected memory leaks so we have overhauled these APIs to reduce the chance of accidental misuse.

Finally, the dimensional points API now takes a field name up front to offer per-field points access, matching how the doc values APIs work.

Lucene 7.0 has not been released, so if you have ideas on any additional major-release-worthy changes you'd like to explore please reach out!

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]