Monday, November 17, 2014

Apache Lucene™ 5.0.0 is coming!

At long last, after a strong series of 4.x feature releases, most recently 4.10.2, we are finally working towards another major Apache Lucene release!

There are no promises for the exact timing (it's done when it's done!), but we already have a volunteer release manager (thank you Anshum!).

A major release in Lucene means all deprecated APIs (as of 4.10.x) are dropped, support for 3.x indices is removed while the numerous 4.x index formats are still supported for index backwards compatibility, and the 4.10.x branch becomes our bug-fix only release series (no new features, no API changes).

5.0.0 already contains a number of exciting changes, which I describe below, and they are still rolling in with ongoing active development.

Stronger index safety

Many of the 5.0.0 changes are focused on providing stronger protection against index corruption.

All file access now uses Java's NIO.2 APIs, giving us better error handling (e.g., Files.delete returns a meaningful exception) along with atomic rename for safer commits, reducing the risk of hideous "your entire index is gone" bugs like this doozie.

Lucene's replication module, along with distributed servers on top of Lucene such as Elasticsearch or Solr, must copy index files from one place to another. They do this for backup purposes (e.g., snapshot and restore), for migrating or recovering a shard from one node to another or when adding a new replica. Such replicators try to be incremental, so that if the same file name is present, with the same length and checksum, it will not be copied again.

Unfortunately, these layers sometimes have subtle bugs (they are complex!). Thanks to checksums (added in 4.8.0), Lucene already detects if the replicator caused any bit-flips while copying, and this revealed a long standing nasty bug in the compression library Elasticsearch uses.

With 5.0.0 we take this even further and now detect if whole files were copied to the wrong file name, by assigning a unique id to every segment and commit (segments_N file). Each index file now records the segment id in its header, and then these ids are cross-checked when the index is opened.

The new Lucene50Codec also includes further index corruption detection.

Even CorruptIndexException itself is improved! It will now always refer to the file or resource where the corruption was detected, as this is now a required argument to its constructors. When corruption is detected higher up (e.g., a bad field number in the field infos file), the resulting CorruptIndexException will now state whether there was also a checksum mismatch in the file, helping to narrow the possible source of the corruption.

Finally, during merge, IndexWriter now always checks the incoming segments for corruption before merging. This can mean, on upgrading to 5.0.0, that merging may uncover long-standing latent corruption in an older 4.x index.

Reduced heap usage

5.0.0 also includes several changes to reduce heap usage during indexing and searching.

If your index has 1B docs, then caching a single FixedBitSet-based filter in 4.10.2 costs a non-trivial 125 MB of heap! But with 5.0.0, Lucene now supports random-writable and advance-able sparse bitsets (RoaringDocIdSet and SparseFixedBitSet), so the heap required is in proportion to how many bits are set, not how many total documents exist in the index. These bitsets also greatly simplify how MultiTermQuery is rewritten (no more CONSTANT_SCORE_AUTO_REWRITE_METHOD), and they provide faster advance implementations than FixedBitSet's linear scan. Finally, they provide a more accurate cost() implementation, allowing Lucene to make better choices about how to drive the intersection at query time.

Heap usage during IndexWriter merging is also much lower with the new Lucene50Codec, since doc values and norms for the segments being merged are no longer fully loaded into heap for all fields; now they are loaded for the one field currently being merged, and then dropped.

The default norms format now uses sparse encoding when appropriate, so indices that enable norms for many sparse fields will see a large reduction in required heap at search time.

An explain API for heap usage

If you still find Lucene using more heap than you expected, 5.0.0 has a new API to print a tree structure showing a recursive breakdown of which parts are using how much heap. This is analogous to Lucene's explain API, used to understand why a document has a certain relevance score, but applied to heap usage instead.

It produces output like this:

_cz(5.0.0):C8330469: 28MB 
  postings [...]: 5.2MB 
    ... 
    field 'latitude' [...]: 678.5KB 
      term index [FST(nodes=6679, ...)]: 678.3KB 
This is a much faster way to see what is using up your heap than trying to stare at a Java heap dump.

Further changes

There is a long tail of additional 5.0.0 changes; here are some of them:

  • Old experimental postings formats (Sep/Fixed/VariableIntPostingsFormat) have been removed. PulsingPostingsFormat has also been removed, since the default postings format already pulses unique terms.

  • FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming than FieldCache.

  • Tokenizers and Analyzers no longer require Reader on init.

  • NormsFormat now gets its own dedicated NormsConsumer/Producer

  • Simplifications to FieldInfo (Lucene's "low schema"): no more normType (it is always a DocValuesType.NUMERIC), no more isIndexed (just check IndexOptions)

  • Compound file handling is simpler, and is now under codec control.

  • SortedSetSortField, used to sort on a multi-valued field, is promoted from sandbox to Lucene's core

  • PostingsFormat now uses a "pull" API when writing postings, just like doc values. This is powerful because you can do things in your postings format that require making more than one pass through the postings such as iterating over all postings for each term to decide which compression format it should use.

  • Version is no longer required on init to classes like IndexWriterConfig and analysis components.

The changes I've described here are just a snapshot of what we have lined up today for a 5.0.0 release. 5.0.0 is still under active development (patches welcome!) so this list will change by the time the actual release is done.

8 comments:

  1. Hi Michael,

    Can you explain What is the difference of using CategoryPath and Using FacetField for feceted search? I have seen many examples where people use either CategoryPath and FacetField.

    ReplyDelete
  2. CategoryPath was used in older releases; newer releases switched to FacetField.

    ReplyDelete
    Replies
    1. Thanks for the quick reply.. But It is not deprecated. I thought that CategoryPath is for Tree-like hierarchies and FacetField is for flat hierarchies.

      Delete
    2. As of the 5.0 Lucene release, CategoryPath is replaced with the simpler FacetField, and FacetField does handle hierarchies.

      Delete
  3. Hi Micheal,

    I already sent a mail to the mailing list. Still I was not able to get any answer, so I am posting my question here. I am trying to create some APIs using lucene facets APIs. First I will explain my requirement with an example. Lets say I am keeping track of the count of people who enter through a certain door. Lets say the time range I am interested in Last 6 hours( to get the total count, I know that I ll have to use Ranged Facets). How do I sample this time range and get the counts of each sample? In other words, as an example, If I split the last 6 hours into 5 minutes samples, I get 72 (6*60/5 ) different time ranges. I would be interested in getting hit counts for each of these 72 ranges in an array with the respective lower bound of each sample. Can you point me the direction I should follow/ the classes which can be helpful looking at? ElasticSearch already has this feature exposed by their Javascript API. I found a method in SplitLongRange in NumericUtils class, but not sure how it is used..

    Is it possible to implement the same with lucene?
    Is there a Facets user guide for lucene 4.10.3 or lucene 5.0.0 ?

    ReplyDelete
    Replies
    1. Hi Gimantha,

      The facets user guide is unfortunately way out of date.

      It looks like Shai did respond to your questions on the list?

      Can you use facet sampling? It does not work with ranges but does work with "ordinary" facets, so e.g. if you indexed the bucket ID here (unique bucket ID for each 5 minute period) then you could sample that?

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Hi
    I want to know how to sort results based on search string match
    Can you please help me

    ReplyDelete