Tuesday, July 3, 2012

Lucene 4.0.0 alpha, at long last!

The 4.0.0 alpha release of Lucene and Solr is finally out!

This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:
  • All deprecated APIs as of 3.6.0 have been removed.

  • Pre-3.0 indices are no longer supported.

  • MIGRATE.txt describes how to update your application code.

  • The index format won't change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta.

Please try the release and report back!

Pluggable Codec

The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.

There are some fun core codecs:
  • Lucene40 is the default codec.

  • Lucene3x (read-only) reads any index written with Lucene 3.x.

  • SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!).

  • MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.)

  • PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup.

  • AppendingCodec avoids seeking while writing, necessary for file-systems such as Hadoop DFS.

If you create your own Codec it's easy to confirm all of Lucene/Solr's tests pass with it. If tests fail then likely your Codec has a bug!

A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.

Flexible scoring

Lucene's scoring is now fully pluggable, with the TF/IDF vector space model remaining as the default. You can create your own scoring model, or use one of the core scoring models (BM25, Divergence from Randomness, Language Models, and Information-based models). Per-document normalization values are no longer limited to a single byte. Various new aggregate statistics are now available.

These changes were part of a 2011 Google Summer of Code project (thank you David!).

These two changes are really important because they remove the barriers to ongoing innovations. Now it's easy to experiment with wild changes to the index format or to Lucene's scoring models. A recent example of such innovation is this neat codec by the devs at Flax to enable updatable fields by storing postings in a Redis key/value store.

Document Values

The new document values API stores strongly typed single-valued fields per document, meant as an eventual replacement for Lucene's field cache. The values are pre-computed during indexing and stored in the index in a column-stride format (values for a single field across all documents are stored together), making it much faster to initialize at search time than the field cache. Values can be fixed 8, 16, 32, 64 bit ints, or variable-bits sized (packed) ints; float or double; and six flavors of byte[] (fixed size or variable sized; dereferenced, straight or sorted).

New Field APIs

The API for creating document fields has changed: Fieldable and AbstractField have been removed, and a new FieldType, factored out of Field class, holds details about how the field's value should be indexed. New classes have been created for specific commonly-used fields:
  • StringField indexes a string as a single token, without norms and as docs only. For example, use this for a primary key (id) field, or for a field you will sort on.

  • TextField indexes the fully tokenized string, with norms and including docs, term frequencies and positions. For example, use this for the primary text field.

  • StoredField is a field whose value is just stored.

  • XXXDocValuesField create typed document values fields.

  • IntField, FloaField, LongField, DoubleField create typed numeric fields for efficient range queries and filters.

If none of these field classes apply, you can always create your own FieldType (typically by starting from the exposed FieldTypes from the above classes and then tweaking), and then construct a Field by passing the name, FieldType and value.

Note that the old APIs (using Index, Store, TermVector enums) are still present (deprecated), to ease migration.

These changes were part of a 2011 Google Summer of Code project (thank you Nikola!).

Other big changes

Lucene's terms are now binary (arbitrary byte[]); by default they are UTF-8 encoded strings, sorted in Unicode sort order. But your Analyzer is free to produce tokens with an arbitrary byte[] (e.g., CollationKeyAnalyzer does so).

A new DirectSpellChecker finds suggestions directly from any Lucene index, avoiding the hassle of maintaining a sidecar spellchecker index. It uses the same fast Levenshtein automata as FuzzyQuery (see below).

Term offsets (the start and end character position of each term) may now be stored in the postings, by using FieldInfo.IndexOption.DOCS_AND_POSITIONS_AND_OFFSETS when indexing the field. I expect this will be useful for fast highlighting without requiring term vectors, but this part is not yet done (patches welcome!).

A new AutomatonQuery matches all documents containing any term matching a provided automaton. Both WildcardQuery and RegexpQuery simply construct the corresponding automaton and then run AutomatonQuery. The classic QueryParser produces a RegexpQuery if you type fieldName:/expression/ or /expression against default field/.

Optimizations

Beyond the fun new features there are some incredible performance gains.

If you use FuzzyQuery, you should see a factor of 100-200 speedup on moderately sized indices.

If you search with a Filter, you can see gains up to 3X faster (depending on filter density and query complexity), thanks to a change that applies filters just like we apply deleted documents.

If you use multiple threads for indexing, you should see stunning throughput gains (265% in that case), thanks to concurrent flushing. You are also now able to use more than 2048 MB IndexWriter RAM buffer (as long as you use multiple threads).

The new BlockTree default terms dictionary uses far less RAM to hold the terms index, and can sometimes avoid going to disk for terms that do not exist. In addition, the field cache also uses substantially less RAM, by avoiding separate objects per document and instead packing character data into shared byte[] blocks. Together this results in a 73% reduction in RAM required for searching in one case.

IndexWriter now buffers term data using byte[] instead of char[], using half the RAM for ASCII terms.

MultiTermQuery now rewrites per-segment, and caches per-term metadata to avoid a second lookup during scoring. This should improve performance though it hasn't been directly tested.

If a BooleanQuery consist of only MUST TermQuery clauses, then a specialized ConjunctionTermScorer is used, giving ~25% speedup.

Reducing merge IO impact

Merging (consolidating many small segments into a single big one) is a very IO and CPU intensive operation which can easily interfere with ongoing searches. In 4.0.0 we now have two ways to reduce this impact:
  • Rate-limit the IO caused by ongoing merging, by calling FSDirectory.setMaxMergeWriteMBPerSec.

  • Use the new NativeUnixDirectory which bypasses the OS's IO cache for all merge IO, by using direct IO. This ensures that a merge won't evict hot pages used by searches. (Note that there is also a native WindowsDirectory, but it does not yet use direct IO during merging... patches welcome!).

Remember to also set swappiness to 0 on Linux if you want to maximize search responsiveness.

More generally, the APIs that open an input or output file (Directory.openInput and Directory.createOutput) now take an IOContext describing what's being done (e.g., flush vs merge), so you can create a custom Directory that changes its behavior depending on the context.

These changes were part of a 2011 Google Summer of Code project (thank you Varun!).

Consolidated modules

The diverse sources, previously scattered between Lucene's and Solr's core and contrib, have been consolidated. Especially noteworthy is the analysis module, providing a rich selection of 48 analyzers across many languages; the queries module, containing function queries (the old core function queries have been removed) and other non-core query classes; and the queryparser module, with numerous query parsers including the classic QueryParser (moved from core).

Other changes

Here's a long list of additional changes:
  • The classic QueryParser now interprets term~N where N is an integer >= 1 as a FuzzyQuery with edit distance N.

  • The field cache normally requires single-valued fields, but we've added FieldCache.getDocTermsOrd which can handle multi-valued fields.

  • Analyzers must always provide a reusable token stream, by implementing the Analyzer.createComponents method (reusableTokenStream has been removed and tokenStream is now final, in Analzyer).

  • IndexReaders are now read-only (you cannot delete document by id, nor change norms) and are strongly typed as AtomicIndexReader or CompositeIndexReader

  • The API for reading term vectors is the same API used for reading all postings, except the term vector API only covers a single document. This is a good match because term vectors are really just a single-document inverted index.

  • Positional queries (PhraseQuery, SpanQuery) will now throw an exception if you run them against a field that did not index positions (previously they silently returned 0 hits).

  • String-based field cache APIs have been replaced with BytesRef based APIs.

  • ParallelMultiSearcher has been absorbed into IndexSearcher as an optional ExecutorService argument to the constructor. Searcher and Searchable have been removed.

  • All serialization code has been removed from Lucene's classes; you must handle serialization at a higher level in your application.

  • Field names are no longer interned, so you cannot rely on == to test for equality (use .equals instead).

  • *SpanFilter has been removed: they created too many objects during searching and were not scalable.

  • Removed IndexSearcher.close: IndexSearcher now only takes a provided IndexReader (no longer a Directory), which is the caller's responsibility to close.

  • You cannot put foreign files into the index directory anymore: they will be deleted by IndexWriter.

  • FieldSelector (to only load certain stored fields) has been replaced with a simpler StoredFieldVisitor API.

Happy searching!

7 comments:

  1. Thank you for this overview!
    Is there any kind of roadmap on when the 4.0.0 beta and RC version will be released?

    ReplyDelete
  2. Its in a link, but just to mention the scoring improvements were also a Google Summer of Code project by David Nemeskey: http://www.google-melange.com/gsoc/project/google/gsoc2011/davidnemeskey/3001

    ReplyDelete
  3. Christoph: there's no roadmap/schedule for the beta and GA release ... you'll have to keep an eye on the dev list to see how they are progressing.

    Robert: woops, you're right ... I updated the post.

    ReplyDelete
  4. "You cannot put foreign files into the index directory anymore: they will be deleted by IndexWriter. "

    I was wondering what the reason behind this is. What if an application asks the user to select a directory to create an index in, and the user accidentally selects c:\ from a file/directory chooser or something similar.
    Lucene would then proceed to delete all files and folders on this drive? Or is this not the case?

    ReplyDelete
  5. Hi Carl,

    You're right: this is dangerous. I've opened https://issues.apache.org/jira/browse/LUCENE-4190 to improve it...

    Thanks!

    ReplyDelete
  6. Hi Mike, thanks for the link to Flax! I've just written a short blog post describing how to register and use new Codecs, based on what we did with the redis codec. You can find it here: http://www.romseysoftware.co.uk/2012/07/04/writing-a-new-lucene-codec/

    ReplyDelete
  7. Hi Alan,

    Very cool! Thanks for writing that up and sharing. Keep it up :)

    Mike

    ReplyDelete