A codec is actually a collection of formats, one for each part of the index. For example,
StoredFieldsFormat handles stored
fields, NormsFormat handles norms, etc. There are eight
formats in total, and a codec could simply be a new mix of pre-existing
formats, or perhaps you create your own
TermVectorsFormat and otherwise use all the formats from
the Lucene40 codec, for example.
The trickiest format to create is
PostingsFormat,
which provides read/write access to all postings (fields, terms, documents,
frequencies, positions, offsets, payloads). Part of the challenge is
that it has a large API surface area. But there are also complexities
such as skipping, reuse, conditional use of different values in the
enumeration (frequencies, positions, payloads, offsets), partial
consumption of the enumeration, etc. These challenges unfortunately make
it easy for bugs to sneak in, but an awesome way to ferret out all the
bugs is to leverage Lucene's
extensive randomized
tests: run all tests with
-Dtests.postingsformat=XXX (be sure to first
register your new postings
format). If your new postings format has a bug, tests will most
likely fail.
However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.
I recently created a new postings format,
BlockPostingsFormat, which
will hopefully be more efficient than the Sep codec at
using fixed int block encodings. I did this to support Han Jiang's
Google
Summer of Code project to add a useful int block postings format
to Lucene.
So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test,
TestPostingsFormat.
It has layers of testing (documents, +freqs, +positions, +payloads,
+offsets) that you can incrementally enable as you iterate, as well as
different test options (skipping or not, reuse or not, stop visiting
documents and/or positions early, one or more threads, etc.). When
you turn on verbose (-Dtests.verbose=trueThe goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in
TestPostingsFormat! (Who tests the tester?)
If you find yourself creating a new postings format I strongly suggest using the new
TestPostingsFormat during early
development to get your postings format off the ground. Once it's
passing, run all tests with your new postings format, and if something
fails please let us
know so we can fix TestPostingsFormat.
what is Posting format ?
ReplyDeleteIs it something related to the way string is sent to the searcher ??
The PostingsFormat controls how the inverted index (terms mapping to list of documents that contain that term, plus things like frequency, positions, offsets) is stored in the index.
ReplyDeleteThanks Mike Sir for answering.
ReplyDeleteMike
ReplyDeleteLucene newbie here. Sorry if my question sounds silly.
Still trying to understand the terminology related to Postings/PostingFormat/Codec. Are they all same?
From what you said, if custom postingsformat controls how the data is stored in index, would it be right to say that for the same dataset being loaded into index, different postingformats have different performance times?
What would be the best postingformat to use if speed is important.
Also can you point me to documentation/example to build custom postingformat using java?
Thanks
John
Hi Anonymous,
ReplyDeleteA Codec holds many formats: postings, term vectors, deletions, etc. PostingsFormat just covers how postings (= the inverted part of the index, i.e. fields, terms, docs, positions, freqs, offsets) are written.
There are very different performance characteristics for each PostingsFormat ... I would say the best one is the current default one (BlockPostingsFormat): it's fast for frequent terms because it bulk encodes/decodes the integers.
I don't think there's any good document that describes the steps in building a PostingsFormat. I would start by looking at the PostingsFormat impls in the Lucene sources? Typically one builds a PostingsBaseFormat, which plugs into the terms dictionary, because building a new terms dictionary is challenging ... the PostingsBaseFormat only needs to encode the docs/freqs/positions/offsets.
Thanks Mike. Lot clear to me now :)
ReplyDeleteI see that Lucene 4 used Lucene40 Postings format by default. Would it be possible to use BlockPostingsFormat and still be on Lucene4 or do we need to upgrade to Lucene4.1?
-John
Hi John,
ReplyDeleteThat's a good question ... I'm not sure? If any of the codec APIs changed (which is likely: they are marked experimental) then you'd need to upgrade lucene core as well. Try it and report back ;)