A codec is actually a collection of formats, one for each part of the index. For example,
StoredFieldsFormat
handles stored
fields, NormsFormat
handles norms, etc. There are eight
formats in total, and a codec could simply be a new mix of pre-existing
formats, or perhaps you create your own
TermVectorsFormat
and otherwise use all the formats from
the Lucene40
codec, for example.
The trickiest format to create is
PostingsFormat
,
which provides read/write access to all postings (fields, terms, documents,
frequencies, positions, offsets, payloads). Part of the challenge is
that it has a large API surface area. But there are also complexities
such as skipping, reuse, conditional use of different values in the
enumeration (frequencies, positions, payloads, offsets), partial
consumption of the enumeration, etc. These challenges unfortunately make
it easy for bugs to sneak in, but an awesome way to ferret out all the
bugs is to leverage Lucene's
extensive randomized
tests: run all tests with
-Dtests.postingsformat=XXX
(be sure to first
register your new postings
format). If your new postings format has a bug, tests will most
likely fail.
However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.
I recently created a new postings format,
BlockPostingsFormat
, which
will hopefully be more efficient than the Sep
codec at
using fixed int block encodings. I did this to support Han Jiang's
Google
Summer of Code project to add a useful int block postings format
to Lucene.
So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test,
TestPostingsFormat
.
It has layers of testing (documents, +freqs, +positions, +payloads,
+offsets) that you can incrementally enable as you iterate, as well as
different test options (skipping or not, reuse or not, stop visiting
documents and/or positions early, one or more threads, etc.). When
you turn on verbose (-Dtests.verbose=true
The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in
TestPostingsFormat
! (Who tests the tester?)
If you find yourself creating a new postings format I strongly suggest using the new
TestPostingsFormat
during early
development to get your postings format off the ground. Once it's
passing, run all tests with your new postings format, and if something
fails please let us
know so we can fix TestPostingsFormat
.
what is Posting format ?
ReplyDeleteIs it something related to the way string is sent to the searcher ??
The PostingsFormat controls how the inverted index (terms mapping to list of documents that contain that term, plus things like frequency, positions, offsets) is stored in the index.
ReplyDeleteThanks Mike Sir for answering.
ReplyDeleteMike
ReplyDeleteLucene newbie here. Sorry if my question sounds silly.
Still trying to understand the terminology related to Postings/PostingFormat/Codec. Are they all same?
From what you said, if custom postingsformat controls how the data is stored in index, would it be right to say that for the same dataset being loaded into index, different postingformats have different performance times?
What would be the best postingformat to use if speed is important.
Also can you point me to documentation/example to build custom postingformat using java?
Thanks
John
Hi Anonymous,
ReplyDeleteA Codec holds many formats: postings, term vectors, deletions, etc. PostingsFormat just covers how postings (= the inverted part of the index, i.e. fields, terms, docs, positions, freqs, offsets) are written.
There are very different performance characteristics for each PostingsFormat ... I would say the best one is the current default one (BlockPostingsFormat): it's fast for frequent terms because it bulk encodes/decodes the integers.
I don't think there's any good document that describes the steps in building a PostingsFormat. I would start by looking at the PostingsFormat impls in the Lucene sources? Typically one builds a PostingsBaseFormat, which plugs into the terms dictionary, because building a new terms dictionary is challenging ... the PostingsBaseFormat only needs to encode the docs/freqs/positions/offsets.
Thanks Mike. Lot clear to me now :)
ReplyDeleteI see that Lucene 4 used Lucene40 Postings format by default. Would it be possible to use BlockPostingsFormat and still be on Lucene4 or do we need to upgrade to Lucene4.1?
-John
Hi John,
ReplyDeleteThat's a good question ... I'm not sure? If any of the codec APIs changed (which is likely: they are marked experimental) then you'd need to upgrade lucene core as well. Try it and report back ;)
Some advice to all of those asking about what postings are: have a look at the online IR Book at http://www-nlp.stanford.edu/IR-book/ . The first few chapters will help you understand the theory behind postings and inverted index. After reading the IR Book, this blog post makes much more sense! :-) Thanks Mike.
ReplyDeleteThanks Gora, the only IR book is a wonderful resource!
ReplyDeleteIs there documentation about how lucene works internally when a one does a query say "title:Foo" ? e.g. how the tii, tis, prx, fdt, etc.. lookuped and in what order to get the information.
ReplyDeleteI read http://lucene.apache.org/core/3_0_3/fileformats.html but not sure how things work when a query arrives.
DeleteI don't think there's any documentation for this besides the source code itself; could you ask on the dev list (dev@lucene.apache.org)?
DeleteDoes lucene 4.7 use Lucene41PostingsFormat for Postings.nextdoc() while executing the span queries? My requirement is to run multiple span queries like [cat dog]~2 on 2 TB of index and I am worried about the performance as I have to collect all the docs in results. Is there a better postings format to choose from like memory postings while using span queries in solr 4.7?? Does having termvectors help me in this regard?
ReplyDeleteYes, Lucene41PF is still the default, so it's used for all queries to iterate docs/positions. MemoryPF isn't really a good match: it's best for primary key fields (each term matches just one document). Term vectors are unrelated: they are not used during searching, and they are quite slow to retrieve per document.
DeleteThanks, so is there a better postingsformat to choose from while using span queries in solr 4.6 or solr4.7? Since, span queries utilize all the parts of postings like terms, documents, frequencies, positions and offsets.
DeleteWhen I am debugging the lucene 4.6 test cases for span queries, it is showing that for above nextdoc() call it is utilizing DirectPostingsFormat.
Does having termPositions=true and termOffsets=true help?
Span queries are unfortunately very costly. Have you tried a sloppy PhraseQuery instead? They will still be costly but perhaps less-so.
DeleteHowever, even better: do a simpler (non-positional) query and then use the new rescoring APIs to re-sort the top e.g. 500 hits with your positional queries.
DirectPostingsFormat is a very RAM-heavy, but fast, postings format.
You don't need termOffsets for span queries; that's typically used for highlighting with PostingsHighlighter.
I will convert and run all span queries as conjunction queries and then use the output as filter query for the original span query. That might reduce the scope of span query substantially.
ReplyDeleteThanks for the idea.
Hi Mike,
ReplyDeleteWe were trying to use a configurable PerFieldPostingsFormat to get an updateable field with a key-value store. Something similar to what you have linked in your post(Flax). However, I have a doubt that an updateable field is possible using codecs for the following reasons:
Assuming the key for this key-value store is: segmentName_fieldName_term and value is the postingsArray: say just array of docIds.
Assume there is a field PK (id kind of field) which is stored and uses default codec.
#1. fieldConsumers of PostingsFormat are called only at flush time. With this design handling an update for a document not yet flushed seems difficult for a field writing to a key-value store.
So if I add a document with say PK:1 and then add another partial update document for PK:1 then to update this, we first need to figure out the segmentName for this document, only then we can update it in key-value store. But this doc has no segmentName as of now. Because of DWPTs we can't even assume that the document will be in this same segment.
Note: Since the custom PostingsFormat is only called at flush time, we write the postings data directly to the key-value store.
#2:Merged segments are just checkpointed, they are not committed. So, for the custom TermsConsumer.merge() call we can not write the new merged state to the key-value store directly. Note that merge() and flush() both use same methods like startTerm(),startDoc(),finishTerm() etc. And since in these methods the design is to write directly to key-value store, we do not use TermsConsumer.merge() function. We actually put the new merged state in-memory but have no good signal to flush it to the key-value store. We are flushing this in-memory merge state at the next flush. However, Lucene can commit the checkpointed merged segment without any document to be flushed. So we get into a state where Lucene Index Dir has new merged segments, the key-value store is not yet updated with that. If we write the merged state directly, then we get into a state where key-value store has the new merged segment but not the Lucene Directory.
#3 - Dropped Segments. Again can't handle this as the only time our custom consumers can remove deleted docs is at merge time. But Lucene can drop segments during normal commits.
#4 - Replication: We do take index files and key-value store files for replication. We can not keep both of them in sync.
#5 - Applying updates to a reindexed document when reindexed document is taken by a DWPT which is flushed after the partial update document. Updates are lost in this case.
Currently, I plan to try adding SegmentInfosFormat(May be LiveDocsFormat too) and a DirectoryWrapper but as if now I do not have any idea if it will solve all problems.
It will be great to know your thoughts on this on the feasibility of providing updateable fields using codecs.
Hi Aditya,
DeleteSorry for the slow response here; if you haven't already, could you ask this question on the Lucene users list instead? (java-user@lucene.apache.org).
Hi Mike,
DeleteThanks a lot for your reply. I could solve some of the easier issues listed above. Will put a new question based on the remaining issues.
Hello Mike,
ReplyDeleteI'm starting to study Lucene, and I'm curious to see how the inverted index of the document gets after indexing some files.
I would like to know what steps I should take and which classes to use.
Thanks if you can respond.
Hi Igor, I suggest starting with Lucene's "demo" module, specifically IndexFiles.java and SearchFiles.java. And then send questions to Lucene's users list (java-user@lucene.apache.org).
Delete