This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:
- All deprecated APIs as of 3.6.0 have been removed.
- Pre-3.0 indices are no longer supported.
- MIGRATE.txt describes how to update your
application code.
- The index format won't change (unless a serious bug fix
requires it) between this release and 4.0 GA, but APIs may
still change before 4.0.0 beta.
Pluggable Codec
The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.
There are some fun core codecs:
-
Lucene40
is the default codec.
-
Lucene3x
(read-only) reads any index written with Lucene 3.x.
- SimpleText
stores everything in plain text files (great for learning and
debugging, but awful for production!).
- MemoryPostingsFormat
stores all postings (terms, documents, positions, offsets) in RAM as
a fast and
compact FST,
useful for fields with limited postings (primary key (id) field, date
field, etc.)
- PulsingPostingsFormat
inlines postings for low-frequency terms directly into the terms
dictionary, saving a disk seek on lookup.
-
AppendingCodec
avoids seeking while writing, necessary for file-systems such as Hadoop DFS.
A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.
Flexible scoring
Lucene's scoring is now fully pluggable, with the TF/IDF vector space model remaining as the default. You can create your own scoring model, or use one of the core scoring models (BM25, Divergence from Randomness, Language Models, and Information-based models). Per-document normalization values are no longer limited to a single byte. Various new aggregate statistics are now available.
These changes were part of a 2011 Google Summer of Code project (thank you David!).
These two changes are really important because they remove the barriers to ongoing innovations. Now it's easy to experiment with wild changes to the index format or to Lucene's scoring models. A recent example of such innovation is this neat codec by the devs at Flax to enable updatable fields by storing postings in a Redis key/value store.
Document Values
The new document values API stores strongly typed single-valued fields per document, meant as an eventual replacement for Lucene's field cache. The values are pre-computed during indexing and stored in the index in a column-stride format (values for a single field across all documents are stored together), making it much faster to initialize at search time than the field cache. Values can be fixed 8, 16, 32, 64 bit ints, or variable-bits sized (packed) ints; float or double; and six flavors of byte[] (fixed size or variable sized; dereferenced, straight or sorted).
New Field APIs
The API for creating document fields has changed:
Fieldable
and AbstractField
have
been removed, and a new FieldType
, factored out
of Field
class, holds details about how the field's value
should be indexed. New classes have been created for specific
commonly-used fields:
-
StringField
indexes a string as a single token, without norms and as docs only. For example, use this for a primary key (id) field, or for a field you will sort on.
-
TextField
indexes the fully tokenized string, with norms and including docs, term frequencies and positions. For example, use this for the primary text field.
-
StoredField
is a field whose value is just stored.
-
XXXDocValuesField
create typed document values fields.
-
IntField
,FloaField
,LongField
,DoubleField
create typed numeric fields for efficient range queries and filters.
FieldType
(typically by starting from the exposed
FieldType
s from the above classes and then tweaking), and
then construct a Field
by passing the name,
FieldType
and value.
Note that the old APIs (using
Index
, Store
, TermVector
enums) are still present (deprecated), to ease migration.
These changes were part of a 2011 Google Summer of Code project (thank you Nikola!).
Other big changes
Lucene's terms are now binary (arbitrary byte[]); by default they are UTF-8 encoded strings, sorted in Unicode sort order. But your
Analyzer
is free to
produce tokens with an arbitrary byte[]
(e.g., CollationKeyAnalyzer
does so).
A new
DirectSpellChecker
finds suggestions directly from
any Lucene index, avoiding the hassle of maintaining a sidecar
spellchecker index. It uses the same fast Levenshtein automata
as FuzzyQuery
(see below).
Term offsets (the start and end character position of each term) may now be stored in the postings, by using
FieldInfo.IndexOption.DOCS_AND_POSITIONS_AND_OFFSETS
when indexing the field. I expect this will be useful for fast
highlighting without requiring term vectors, but this part is not yet
done (patches welcome!).
A new
AutomatonQuery
matches
all documents containing any term matching a provided automaton.
Both WildcardQuery
and RegexpQuery
simply construct the corresponding
automaton and then run AutomatonQuery
. The classic
QueryParser
produces a RegexpQuery
if you
type fieldName:/expression/
or /expression against
default field/
.
Optimizations
Beyond the fun new features there are some incredible performance gains.
If you use
FuzzyQuery
, you should see
a factor
of 100-200 speedup on moderately sized indices.
If you search with a
Filter
, you can see gains up to
3X faster (depending on filter density and query complexity),
thanks to a change that applies filters just like we apply deleted
documents.
If you use multiple threads for indexing, you should see stunning throughput gains (265% in that case), thanks to concurrent flushing. You are also now able to use more than 2048 MB
IndexWriter
RAM buffer (as long as you use multiple
threads).
The new
BlockTree
default terms dictionary uses far less
RAM to hold the terms index, and can sometimes avoid going to disk for
terms that do not exist. In addition, the field cache also uses
substantially less RAM, by avoiding separate objects per document and
instead packing character data into shared byte[]
blocks.
Together this results in
a 73%
reduction in RAM required for searching in one case.
IndexWriter
now buffers term data
using byte[]
instead of char[]
, using half
the RAM for ASCII terms.
MultiTermQuery
now rewrites per-segment, and caches
per-term metadata to avoid a second lookup during scoring. This should
improve performance though it hasn't been directly tested.
If a
BooleanQuery
consist of only
MUST TermQuery
clauses, then a
specialized ConjunctionTermScorer
is used, giving ~25%
speedup.
Reducing merge IO impact
Merging (consolidating many small segments into a single big one) is a very IO and CPU intensive operation which can easily interfere with ongoing searches. In 4.0.0 we now have two ways to reduce this impact:
- Rate-limit the IO caused by ongoing merging, by
calling
FSDirectory.setMaxMergeWriteMBPerSec
.
- Use the new
NativeUnixDirectory
which bypasses the OS's IO cache for all merge IO, by using direct IO. This ensures that a merge won't evict hot pages used by searches. (Note that there is also a nativeWindowsDirectory
, but it does not yet use direct IO during merging... patches welcome!).
More generally, the APIs that open an input or output file (
Directory.openInput
and Directory.createOutput
) now take
an IOContext
describing what's being done (e.g., flush vs
merge), so you can create a custom Directory
that changes
its behavior depending on the context.
These changes were part of a 2011 Google Summer of Code project (thank you Varun!).
Consolidated modules
The diverse sources, previously scattered between Lucene's and Solr's core and contrib, have been consolidated. Especially noteworthy is the
analysis
module, providing a rich selection of 48
analyzers across many languages; the queries
module,
containing function queries (the old core function queries have been
removed) and other non-core query classes; and
the queryparser
module, with numerous query parsers
including the classic QueryParser
(moved from core).
Other changes
Here's a long list of additional changes:
- The classic
QueryParser
now interprets term~N where N is an integer >= 1 as aFuzzyQuery
with edit distance N.
- The field cache normally requires single-valued fields, but
we've added
FieldCache.getDocTermsOrd
which can handle multi-valued fields.
- Analyzers must always provide a reusable token stream, by
implementing the
Analyzer.createComponents
method (reusableTokenStream
has been removed andtokenStream
is now final, inAnalzyer
).
-
IndexReader
s are now read-only (you cannot delete document by id, nor change norms) and are strongly typed asAtomicIndexReader
orCompositeIndexReader
- The API for reading term vectors is the same API used for
reading all postings, except the term vector API only covers a
single document. This is a good match because term vectors are
really just a single-document inverted index.
- Positional queries
(
PhraseQuery
,SpanQuery)
will now throw an exception if you run them against a field that did not index positions (previously they silently returned 0 hits).
- String-based field cache APIs have been replaced
with
BytesRef
based APIs.
-
ParallelMultiSearcher
has been absorbed intoIndexSearcher
as an optionalExecutorService
argument to the constructor.Searcher
andSearchable
have been removed.
- All serialization code has been removed from Lucene's classes;
you must handle serialization at a higher level in your
application.
- Field names are no longer interned, so you cannot rely
on
==
to test for equality (use.equals
instead).
-
*SpanFilter
has been removed: they created too many objects during searching and were not scalable.
- Removed
IndexSearcher.close
:IndexSearcher
now only takes a providedIndexReader
(no longer aDirectory
), which is the caller's responsibility to close.
- You cannot put foreign files into the index directory anymore:
they will be deleted by
IndexWriter
.
-
FieldSelector
(to only load certain stored fields) has been replaced with a simplerStoredFieldVisitor
API.
Thank you for this overview!
ReplyDeleteIs there any kind of roadmap on when the 4.0.0 beta and RC version will be released?
Its in a link, but just to mention the scoring improvements were also a Google Summer of Code project by David Nemeskey: http://www.google-melange.com/gsoc/project/google/gsoc2011/davidnemeskey/3001
ReplyDeleteChristoph: there's no roadmap/schedule for the beta and GA release ... you'll have to keep an eye on the dev list to see how they are progressing.
ReplyDeleteRobert: woops, you're right ... I updated the post.
"You cannot put foreign files into the index directory anymore: they will be deleted by IndexWriter. "
ReplyDeleteI was wondering what the reason behind this is. What if an application asks the user to select a directory to create an index in, and the user accidentally selects c:\ from a file/directory chooser or something similar.
Lucene would then proceed to delete all files and folders on this drive? Or is this not the case?
Hi Carl,
ReplyDeleteYou're right: this is dangerous. I've opened https://issues.apache.org/jira/browse/LUCENE-4190 to improve it...
Thanks!
Hi Mike, thanks for the link to Flax! I've just written a short blog post describing how to register and use new Codecs, based on what we did with the redis codec. You can find it here: http://www.romseysoftware.co.uk/2012/07/04/writing-a-new-lucene-codec/
ReplyDeleteHi Alan,
ReplyDeleteVery cool! Thanks for writing that up and sharing. Keep it up :)
Mike