Tuesday, March 14, 2017

Apache Lucene 7.0 Is Coming Soon!

The Apache Lucene project will likely release its next major release, 7.0, in a few months!

Remember that Lucene developers generally try hard to backport new features for the next non-major (feature) release, and the upcoming 6.5 already has many great changes, so a new major release is exciting because it means the 7.0-only features, which I now describe, are the particularly big ones that we felt could not be backported for 6.5.

Of course, with every major release, we also do more mundane things like remove deprecated 6.x APIs, and drop support for old indices (written with Lucene 5.x or earlier).

This is only a subset of the new 7.0 only features; for the full list please see the 7.0.0 section in the upcoming CHANGES.txt.

Doc values as iterators

The biggest change in 7.0 is changing doc values from a random access API to a more restrictive iterator API.

Doc values are Lucene's column-stride numeric, sorted or binary per-document field storage across all documents. They can be used to hold scoring signals, such as the single-byte (by default) document length encoding or application-dependent signals, or for sorting, faceting or grouping, or even numeric fields that you might use for range filtering in some queries. Their column-stride storage means it's efficient to visit all values for the one field across documents, in contrast to row-stride storage that stored fields use to retrieve all field values for a single document.

Postings have long been consumed through an iterator, so this was a relatively natural change to make, and the two share the same base class, DocIdSetIterator, to step through or seek to each hit.

The initial rote switch to an iterator API was really just a plumbing swap and less interesting than all the subsequent user-impacting improvements that became possible thanks to the more restrictive API:
With these changes, you finally only pay for what you actually use with doc values, in index size, indexing performance, etc. This is the same as other parts of the index like postings, stored fields, term vectors, etc., and it means users with very sparse doc values no longer see merges taking unreasonably long time or the index becoming unexpectedly huge while merging.

Our nightly sparse benchmarks, based on the NYC Trip Data corpus, show the impressive gains each of the above changes (and more!) accomplished.

Goodbye index-time boosts

Index-time boosting, which lets you increase the a-priori score for a particular document versus other documents, is now deprecated and will be removed in 7.0.

This has always been a fragile feature: it was encoded, along with the field's length, into a single byte value, and thus had very low precision. Furthermore, it is now straightforward to write your custom boost into your own doc values field and use function queries to apply the boost at search time. Finally, with index time boosts gone, length encoding is more accurate, and in particular the first nine length values (1 to 9) are distinct.

Query scoring is simpler

BooleanQuery has long exposed a confusing scoring feature called the coordination factor (coord), to reward hits containing a higher percentage of the search terms. However, this hack is only necessary for scoring models like TF/IDF which have "weak" term saturation such that many occurrences of a single term in a document would be more powerful than adding a single occurence of another term from the query. Since this was specific to one scoring model, TFIDFSimilarity, and since Lucene has now switched to the better Okapi BM25 scoring model by default, we have now fully removed coordination factors in 7.0 from both BooleanQuery and Similarity.

Likewise, the query normalization phase of scoring will be removed. This phase tried to equalize scores across different queries and indices so that they are more comparable, but didn't alter the sort order of hits, and was also TF/IDF specific.

With these scoring simplifications BooleanQuery now makes more aggressive query optimizations when the same sub-clause occurs with different Occur constraints, previously not possible since the scores would change.

Classic Query Parser no longer splits on whitespace

Lucene's original, now called "classic", query parser, always pre-splits the incoming query text on whitespace, and then separately sends those single tokens to the query-time analyzer. This means multi-token filters, such as SynonymGraphFilter or ShingleFilter, will not work.

For example, if the user asks for "denial of service attack" and you had a synonym mapping "denial of service" to DOS, the classic query parser would separately analyze "denial", "of" and "service" so your synonym would never match.

We have already added an option to the query parser to not pre-split on whitespace but left the default unchanged for 6.x releases to preserve backwards compatibility. Finally, with 7.0, we fix that default so that analyzers can see multiple tokens at once, and synonyms will work.

More stuff

As of 7.0 Lucene will (finally!) record into the index metadata which Lucene version was used to originally create it. This knowledge can help us implement future backwards compatibility.

Finite state transducers, used in many ways in Lucene, used to have a complex method call pack which would eek out a few more bytes to further shrink the already small size of the FST. But the code was complex and rarely used and sometimes even made the FST larger so we have removed it for 7.0.

IndexWriter, used to add, update and delete documents in your index, will no longer accept broken token offsets sometimes produced by mis-behaving token filters. Offsets are used for highlighting, and broken offsets, where the end offset for a single token comes before the start offset, or the start offset of a token goes backwards versus the previous token, can only break search-time highlighting. So with this change, Lucene prevents such mistakes at index time by throwing an exception. To ease this transition for cases where users didn't even know their analyzer was producing broken offsets, we've also added a few token filters to "correct" offsets before they are passed to IndexWriter.

Advanced users of Lucene often need to cache something custom for each segment at search time, but the APIs for this are trappy and can lead to unexpected memory leaks so we have overhauled these APIs to reduce the chance of accidental misuse.

Finally, the dimensional points API now takes a field name up front to offer per-field points access, matching how the doc values APIs work.

Lucene 7.0 has not been released, so if you have ideas on any additional major-release-worthy changes you'd like to explore please reach out!

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]

13 comments:

  1. Thanks for amazing post, as always! Was curious when Lucene 6.5 will release, it has many awesome fixes and additions, like ones you did for *GraphFilter ?

    ReplyDelete
    Replies
    1. Hello Jigar,

      Thank you!

      Actually, the 6.5 release process will likely start tomorrow with a new branch: http://lucene.markmail.org/thread/gskdscum7rvofdua

      Delete
    2. Hello Michael. I want to index concepts within a text documents with lucene, but I don't how to do it. Please can you give me some ideas ?

      Delete
    3. Hi Adama,

      Maybe send an email to the Lucene user's list (java-user@lucene.apache.org) with more details? You need to either add these concepts yourself, e.g. in a new field on the document, or create your own TokenFilter which detects concepts during analysis and then inserts them into the stream of tokens, much like SynonymFilter.

      Delete
  2. Hello Michael,

    My application needs to get the total number of indexed terms for browsing the dictionary of terms. This is done by reading the index and TermsEnum.

    This is acceptable for small and medium index files but unacceptable for large index files. Anything new on this issue for 6.5 and 7.0 releases ?

    Thank you in advance,
    Jean-Claude

    ReplyDelete
  3. Hi JCD, if you really just want the total number of terms in a given field, you can use the Terms.size method? We also track other summary corpus level statistics, e.g. the sum of total term frequency across all terms, sum of the doc freq across all terms, etc.

    ReplyDelete
  4. Thank you Michael for yr suggestion. The problem is that Terms.size method returns -1 because Terms are in fact MultiTerms. The only solution seems to iterate through all terms to count them.

    Fields fields = leafReader.fields();
    for (String field : fields) {
    Terms terms = fields.terms(field);
    if (terms == null) {
    continue;
    }
    TermsEnum termsEnum = terms.iterator();
    BytesRef text;
    while ((text = termsEnum.next()) != null) {
    termsCount++;
    }
    }

    ReplyDelete
    Replies
    1. Ahh I see, yes, there is no better way than counting, to get an exact count.

      Perhaps a probabilistic solution could work, where you estimate %tg terms that are in common across the large segments, and project out, but I don't know of any specific efforts to do that with Lucene.

      You could also IW.forceMerge(1) your index down to a single segment, and then the count is correct, but there is massive cost with that operation.

      Delete
  5. Thanks Michael for yr response.

    I finally wrote a class to parallelize terms count per field, but I am not sure it's a good idea. Seems it's faster but it's difficult to find out the real improvement.

    static class CountTermsRun implements Callable {

    private final TermsEnum termsEnum_;

    CountTermsRun(TermsEnum termsEnum) {
    termsEnum_ = termsEnum;
    }

    @Override
    public Long call() {
    long termsCount = 0;
    try {
    BytesRef text;
    while ((text = termsEnum_.next()) != null) {
    termsCount++;
    }
    } catch (IOException ex) {
    Logger.getLogger(CountTermsRun.class.getName()).log(Level.SEVERE, null, ex);
    }
    return termsCount;
    }
    }

    public static long getDictionaryTermCount(Fields fields) throws Exception {

    long termsCount = 0;
    ExecutorService executor = Executors.newFixedThreadPool(2);

    List list = new ArrayList<>();
    for (String field : fields) {
    Terms terms = fields.terms(field);
    if (terms == null) {
    continue;
    }
    TermsEnum termsEnum = terms.iterator();
    list.add(new CountTermsRun(termsEnum));

    }

    List> results = executor.invokeAll(list);
    executor.shutdown();

    for (Future result : results) {
    termsCount += result.get();
    }
    return termsCount;
    }

    ReplyDelete
  6. Hi Mike,

    Thanks for the post. Perhaps there is a better/different way of achieving what I want, but I'd love to see the ability to store filter results in the Lucene (and subsequently exposed in ElasticSearch). I have a source control search application to which I am adding the ability to search particular commits in history. Rather than indexing the full set of files at each commit point, I would like to only index the added/changed files. Then I would create a filter for the commit which would be stored in the index (on a per segment basis). When with stored filters is merged, the filters would be merged into the merged segments postings. This functionality could also be used to 'tag' documents if the tag would only be used for filtering purposes. I am currently working on prototyping this in ElasticSearch for my use, but maybe it applicable to others as well.

    Thanks again,
    Lance

    ReplyDelete
    Replies
    1. You could e.g. index a multi-valued StringField per document (source file), where each value is a commit revision where that source file was changed. Then it would be very fast to filter for all sources changed on a given commit?

      Delete
  7. Thanks Mike. To clarify, I need to search all sources present in the repository at a given commit. For instance, if I have 1000 files in my repository at a particular commit containing 2 changes, I want to be able to search the set of 1000 files not just the 2 changed files. Right now, I can accomplish this with a very large terms filter which specifies the unique id for every document present at that commit, but that is not tractable for very large codebases.

    ReplyDelete
    Replies
    1. OK I see. And I assume when a source file is changed, you index a new Lucene document? I.e., a given source file will have N documents in the Lucene index where N is the number of revisions of that source file over time.

      And then to search a given commit, you need to know the current revision for all sources that were in source control as of that commit.

      This seems like a fun problem; I'd have to think more about.

      Maybe try emailing the Lucene users list (java-user@lucene.apache.org)?

      Delete