Changing Bits: Apache Lucene 7.0 Is Coming Soon!

Tuesday, March 14, 2017

Apache Lucene 7.0 Is Coming Soon!

The Apache Lucene project will likely release its next major release, 7.0, in a few months!

Remember that Lucene developers generally try hard to backport new features for the next non-major (feature) release, and the upcoming 6.5 already has many great changes, so a new major release is exciting because it means the 7.0-only features, which I now describe, are the particularly big ones that we felt could not be backported for 6.5.

Of course, with every major release, we also do more mundane things like remove deprecated 6.x APIs, and drop support for old indices (written with Lucene 5.x or earlier).

This is only a subset of the new 7.0 only features; for the full list please see the 7.0.0 section in the upcoming CHANGES.txt.

Doc values as iterators

The biggest change in 7.0 is changing doc values from a random access API to a more restrictive iterator API.

Doc values are Lucene's column-stride numeric, sorted or binary per-document field storage across all documents. They can be used to hold scoring signals, such as the single-byte (by default) document length encoding or application-dependent signals, or for sorting, faceting or grouping, or even numeric fields that you might use for range filtering in some queries. Their column-stride storage means it's efficient to visit all values for the one field across documents, in contrast to row-stride storage that stored fields use to retrieve all field values for a single document.

Postings have long been consumed through an iterator, so this was a relatively natural change to make, and the two share the same base class, DocIdSetIterator, to step through or seek to each hit.

The initial rote switch to an iterator API was really just a plumbing swap and less interesting than all the subsequent user-impacting improvements that became possible thanks to the more restrictive API:

The 7.0 codec now sparsely encodes sparse doc values and length normalization factors ("norms")
Outlier values no longer consume excessive space
Our doc values based queries take advantage of the new API
Both top-level browse-only facet counts and facet counts for hits in a query are now faster in sparse cases
A new advanceExact method enables more efficient skipping

With these changes, you finally only pay for what you actually use with doc values, in index size, indexing performance, etc. This is the same as other parts of the index like postings, stored fields, term vectors, etc., and it means users with very sparse doc values no longer see merges taking unreasonably long time or the index becoming unexpectedly huge while merging.

Our nightly sparse benchmarks, based on the NYC Trip Data corpus, show the impressive gains each of the above changes (and more!) accomplished.

Goodbye index-time boosts

Index-time boosting, which lets you increase the a-priori score for a particular document versus other documents, is now deprecated and will be removed in 7.0.

This has always been a fragile feature: it was encoded, along with the field's length, into a single byte value, and thus had very low precision. Furthermore, it is now straightforward to write your custom boost into your own doc values field and use function queries to apply the boost at search time. Finally, with index time boosts gone, length encoding is more accurate, and in particular the first nine length values (1 to 9) are distinct.

Query scoring is simpler

BooleanQuery has long exposed a confusing scoring feature called the coordination factor (coord), to reward hits containing a higher percentage of the search terms. However, this hack is only necessary for scoring models like TF/IDF which have "weak" term saturation such that many occurrences of a single term in a document would be more powerful than adding a single occurence of another term from the query. Since this was specific to one scoring model, TFIDFSimilarity, and since Lucene has now switched to the better Okapi BM25 scoring model by default, we have now fully removed coordination factors in 7.0 from both BooleanQuery and Similarity.

Likewise, the query normalization phase of scoring will be removed. This phase tried to equalize scores across different queries and indices so that they are more comparable, but didn't alter the sort order of hits, and was also TF/IDF specific.

With these scoring simplifications BooleanQuery now makes more aggressive query optimizations when the same sub-clause occurs with different Occur constraints, previously not possible since the scores would change.

Classic Query Parser no longer splits on whitespace

Lucene's original, now called "classic", query parser, always pre-splits the incoming query text on whitespace, and then separately sends those single tokens to the query-time analyzer. This means multi-token filters, such as SynonymGraphFilter or ShingleFilter, will not work.

For example, if the user asks for "denial of service attack" and you had a synonym mapping "denial of service" to DOS, the classic query parser would separately analyze "denial", "of" and "service" so your synonym would never match.

We have already added an option to the query parser to not pre-split on whitespace but left the default unchanged for 6.x releases to preserve backwards compatibility. Finally, with 7.0, we fix that default so that analyzers can see multiple tokens at once, and synonyms will work.

More stuff

As of 7.0 Lucene will (finally!) record into the index metadata which Lucene version was used to originally create it. This knowledge can help us implement future backwards compatibility.

Finite state transducers, used in many ways in Lucene, used to have a complex method call pack which would eek out a few more bytes to further shrink the already small size of the FST. But the code was complex and rarely used and sometimes even made the FST larger so we have removed it for 7.0.

IndexWriter, used to add, update and delete documents in your index, will no longer accept broken token offsets sometimes produced by mis-behaving token filters. Offsets are used for highlighting, and broken offsets, where the end offset for a single token comes before the start offset, or the start offset of a token goes backwards versus the previous token, can only break search-time highlighting. So with this change, Lucene prevents such mistakes at index time by throwing an exception. To ease this transition for cases where users didn't even know their analyzer was producing broken offsets, we've also added a few token filters to "correct" offsets before they are passed to IndexWriter.

Advanced users of Lucene often need to cache something custom for each segment at search time, but the APIs for this are trappy and can lead to unexpected memory leaks so we have overhauled these APIs to reduce the chance of accidental misuse.

Finally, the dimensional points API now takes a field name up front to offer per-field points access, matching how the doc values APIs work.

Lucene 7.0 has not been released, so if you have ideas on any additional major-release-worthy changes you'd like to explore please reach out!

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]

13 comments:

JigarMarch 15, 2017 at 12:24 PM
Thanks for amazing post, as always! Was curious when Lucene 6.5 will release, it has many awesome fixes and additions, like ones you did for *GraphFilter ?
ReplyDelete
Replies
JCDMarch 16, 2017 at 8:28 AM
Hello Michael,

My application needs to get the total number of indexed terms for browsing the dictionary of terms. This is done by reading the index and TermsEnum.

This is acceptable for small and medium index files but unacceptable for large index files. Anything new on this issue for 6.5 and 7.0 releases ?

Thank you in advance,
Jean-Claude
ReplyDelete
Replies
Michael McCandlessMarch 17, 2017 at 9:05 AM
Hi JCD, if you really just want the total number of terms in a given field, you can use the Terms.size method? We also track other summary corpus level statistics, e.g. the sum of total term frequency across all terms, sum of the doc freq across all terms, etc.
ReplyDelete
Replies
JCDMarch 21, 2017 at 12:14 PM
Thank you Michael for yr suggestion. The problem is that Terms.size method returns -1 because Terms are in fact MultiTerms. The only solution seems to iterate through all terms to count them.

Fields fields = leafReader.fields();
for (String field : fields) {
Terms terms = fields.terms(field);
if (terms == null) {
continue;
}
TermsEnum termsEnum = terms.iterator();
BytesRef text;
while ((text = termsEnum.next()) != null) {
termsCount++;
}
}
ReplyDelete
Replies
JCDMarch 31, 2017 at 4:10 PM
Thanks Michael for yr response.

I finally wrote a class to parallelize terms count per field, but I am not sure it's a good idea. Seems it's faster but it's difficult to find out the real improvement.

static class CountTermsRun implements Callable {

private final TermsEnum termsEnum_;

CountTermsRun(TermsEnum termsEnum) {
termsEnum_ = termsEnum;
}

@Override
public Long call() {
long termsCount = 0;
try {
BytesRef text;
while ((text = termsEnum_.next()) != null) {
termsCount++;
}
} catch (IOException ex) {
Logger.getLogger(CountTermsRun.class.getName()).log(Level.SEVERE, null, ex);
}
return termsCount;
}
}

public static long getDictionaryTermCount(Fields fields) throws Exception {

long termsCount = 0;
ExecutorService executor = Executors.newFixedThreadPool(2);

List list = new ArrayList<>();
for (String field : fields) {
Terms terms = fields.terms(field);
if (terms == null) {
continue;
}
TermsEnum termsEnum = terms.iterator();
list.add(new CountTermsRun(termsEnum));

}

List> results = executor.invokeAll(list);
executor.shutdown();

for (Future result : results) {
termsCount += result.get();
}
return termsCount;
}

ReplyDelete
Replies
UnknownApril 15, 2017 at 12:20 PM
Hi Mike,

Thanks for the post. Perhaps there is a better/different way of achieving what I want, but I'd love to see the ability to store filter results in the Lucene (and subsequently exposed in ElasticSearch). I have a source control search application to which I am adding the ability to search particular commits in history. Rather than indexing the full set of files at each commit point, I would like to only index the added/changed files. Then I would create a filter for the commit which would be stored in the index (on a per segment basis). When with stored filters is merged, the filters would be merged into the merged segments postings. This functionality could also be used to 'tag' documents if the tag would only be used for filtering purposes. I am currently working on prototyping this in ElasticSearch for my use, but maybe it applicable to others as well.

Thanks again,
Lance
ReplyDelete
Replies
UnknownMay 17, 2017 at 11:13 AM
Thanks Mike. To clarify, I need to search all sources present in the repository at a given commit. For instance, if I have 1000 files in my repository at a particular commit containing 2 changes, I want to be able to search the set of 1000 files not just the 2 changed files. Right now, I can accomplish this with a very large terms filter which specifies the unique id for every document present at that commit, but that is not tractable for very large codebases.
ReplyDelete
Replies

Add comment