This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:
- All deprecated APIs as of 3.6.0 have been removed.
- Pre-3.0 indices are no longer supported.
- MIGRATE.txt describes how to update your
- The index format won't change (unless a serious bug fix
requires it) between this release and 4.0 GA, but APIs may
still change before 4.0.0 beta.
The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.
There are some fun core codecs:
Lucene40is the default codec.
Lucene3x(read-only) reads any index written with Lucene 3.x.
stores everything in plain text files (great for learning and
debugging, but awful for production!).
stores all postings (terms, documents, positions, offsets) in RAM as
a fast and
useful for fields with limited postings (primary key (id) field, date
inlines postings for low-frequency terms directly into the terms
dictionary, saving a disk seek on lookup.
AppendingCodecavoids seeking while writing, necessary for file-systems such as Hadoop DFS.
A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.
Lucene's scoring is now fully pluggable, with the TF/IDF vector space model remaining as the default. You can create your own scoring model, or use one of the core scoring models (BM25, Divergence from Randomness, Language Models, and Information-based models). Per-document normalization values are no longer limited to a single byte. Various new aggregate statistics are now available.
These changes were part of a 2011 Google Summer of Code project (thank you David!).
These two changes are really important because they remove the barriers to ongoing innovations. Now it's easy to experiment with wild changes to the index format or to Lucene's scoring models. A recent example of such innovation is this neat codec by the devs at Flax to enable updatable fields by storing postings in a Redis key/value store.
The new document values API stores strongly typed single-valued fields per document, meant as an eventual replacement for Lucene's field cache. The values are pre-computed during indexing and stored in the index in a column-stride format (values for a single field across all documents are stored together), making it much faster to initialize at search time than the field cache. Values can be fixed 8, 16, 32, 64 bit ints, or variable-bits sized (packed) ints; float or double; and six flavors of byte (fixed size or variable sized; dereferenced, straight or sorted).
New Field APIs
The API for creating document fields has changed:
AbstractFieldhave been removed, and a new
FieldType, factored out of
Fieldclass, holds details about how the field's value should be indexed. New classes have been created for specific commonly-used fields:
StringFieldindexes a string as a single token, without norms and as docs only. For example, use this for a primary key (id) field, or for a field you will sort on.
TextFieldindexes the fully tokenized string, with norms and including docs, term frequencies and positions. For example, use this for the primary text field.
StoredFieldis a field whose value is just stored.
XXXDocValuesFieldcreate typed document values fields.
DoubleFieldcreate typed numeric fields for efficient range queries and filters.
FieldType(typically by starting from the exposed
FieldTypes from the above classes and then tweaking), and then construct a
Fieldby passing the name,
Note that the old APIs (using
TermVectorenums) are still present (deprecated), to ease migration.
These changes were part of a 2011 Google Summer of Code project (thank you Nikola!).
Other big changes
Lucene's terms are now binary (arbitrary byte); by default they are UTF-8 encoded strings, sorted in Unicode sort order. But your
Analyzeris free to produce tokens with an arbitrary
DirectSpellCheckerfinds suggestions directly from any Lucene index, avoiding the hassle of maintaining a sidecar spellchecker index. It uses the same fast Levenshtein automata as
Term offsets (the start and end character position of each term) may now be stored in the postings, by using
FieldInfo.IndexOption.DOCS_AND_POSITIONS_AND_OFFSETSwhen indexing the field. I expect this will be useful for fast highlighting without requiring term vectors, but this part is not yet done (patches welcome!).
AutomatonQuerymatches all documents containing any term matching a provided automaton. Both
RegexpQuerysimply construct the corresponding automaton and then run
AutomatonQuery. The classic
RegexpQueryif you type
/expression against default field/.
Beyond the fun new features there are some incredible performance gains.
If you use
FuzzyQuery, you should see a factor of 100-200 speedup on moderately sized indices.
If you search with a
Filter, you can see gains up to 3X faster (depending on filter density and query complexity), thanks to a change that applies filters just like we apply deleted documents.
If you use multiple threads for indexing, you should see stunning throughput gains (265% in that case), thanks to concurrent flushing. You are also now able to use more than 2048 MB
IndexWriterRAM buffer (as long as you use multiple threads).
BlockTreedefault terms dictionary uses far less RAM to hold the terms index, and can sometimes avoid going to disk for terms that do not exist. In addition, the field cache also uses substantially less RAM, by avoiding separate objects per document and instead packing character data into shared
byteblocks. Together this results in a 73% reduction in RAM required for searching in one case.
IndexWriternow buffers term data using
char, using half the RAM for ASCII terms.
MultiTermQuerynow rewrites per-segment, and caches per-term metadata to avoid a second lookup during scoring. This should improve performance though it hasn't been directly tested.
BooleanQueryconsist of only MUST
TermQueryclauses, then a specialized
ConjunctionTermScoreris used, giving ~25% speedup.
Reducing merge IO impact
Merging (consolidating many small segments into a single big one) is a very IO and CPU intensive operation which can easily interfere with ongoing searches. In 4.0.0 we now have two ways to reduce this impact:
- Rate-limit the IO caused by ongoing merging, by
- Use the new
NativeUnixDirectorywhich bypasses the OS's IO cache for all merge IO, by using direct IO. This ensures that a merge won't evict hot pages used by searches. (Note that there is also a native
WindowsDirectory, but it does not yet use direct IO during merging... patches welcome!).
More generally, the APIs that open an input or output file (
Directory.createOutput) now take an
IOContextdescribing what's being done (e.g., flush vs merge), so you can create a custom
Directorythat changes its behavior depending on the context.
These changes were part of a 2011 Google Summer of Code project (thank you Varun!).
The diverse sources, previously scattered between Lucene's and Solr's core and contrib, have been consolidated. Especially noteworthy is the
analysismodule, providing a rich selection of 48 analyzers across many languages; the
queriesmodule, containing function queries (the old core function queries have been removed) and other non-core query classes; and the
queryparsermodule, with numerous query parsers including the
classic QueryParser(moved from core).
Here's a long list of additional changes:
- The classic
QueryParsernow interprets term~N where N is an integer >= 1 as a
FuzzyQuerywith edit distance N.
- The field cache normally requires single-valued fields, but
FieldCache.getDocTermsOrdwhich can handle multi-valued fields.
- Analyzers must always provide a reusable token stream, by
reusableTokenStreamhas been removed and
tokenStreamis now final, in
IndexReaders are now read-only (you cannot delete document by id, nor change norms) and are strongly typed as
- The API for reading term vectors is the same API used for
reading all postings, except the term vector API only covers a
single document. This is a good match because term vectors are
really just a single-document inverted index.
- Positional queries
SpanQuery)will now throw an exception if you run them against a field that did not index positions (previously they silently returned 0 hits).
- String-based field cache APIs have been replaced
ParallelMultiSearcherhas been absorbed into
IndexSearcheras an optional
ExecutorServiceargument to the constructor.
Searchablehave been removed.
- All serialization code has been removed from Lucene's classes;
you must handle serialization at a higher level in your
- Field names are no longer interned, so you cannot rely
==to test for equality (use
*SpanFilterhas been removed: they created too many objects during searching and were not scalable.
IndexSearchernow only takes a provided
IndexReader(no longer a
Directory), which is the caller's responsibility to close.
- You cannot put foreign files into the index directory anymore:
they will be deleted by
FieldSelector(to only load certain stored fields) has been replaced with a simpler
Thank you for this overview!ReplyDelete
Is there any kind of roadmap on when the 4.0.0 beta and RC version will be released?
Its in a link, but just to mention the scoring improvements were also a Google Summer of Code project by David Nemeskey: http://www.google-melange.com/gsoc/project/google/gsoc2011/davidnemeskey/3001ReplyDelete
Christoph: there's no roadmap/schedule for the beta and GA release ... you'll have to keep an eye on the dev list to see how they are progressing.ReplyDelete
Robert: woops, you're right ... I updated the post.
"You cannot put foreign files into the index directory anymore: they will be deleted by IndexWriter. "ReplyDelete
I was wondering what the reason behind this is. What if an application asks the user to select a directory to create an index in, and the user accidentally selects c:\ from a file/directory chooser or something similar.
Lucene would then proceed to delete all files and folders on this drive? Or is this not the case?
You're right: this is dangerous. I've opened https://issues.apache.org/jira/browse/LUCENE-4190 to improve it...
Hi Mike, thanks for the link to Flax! I've just written a short blog post describing how to register and use new Codecs, based on what we did with the redis codec. You can find it here: http://www.romseysoftware.co.uk/2012/07/04/writing-a-new-lucene-codec/ReplyDelete
Very cool! Thanks for writing that up and sharing. Keep it up :)