Sunday, July 29, 2012

Building a new Lucene postings format

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene's index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

The trickiest format to create is PostingsFormat, which provides read/write access to all postings (fields, terms, documents, frequencies, positions, offsets, payloads). Part of the challenge is that it has a large API surface area. But there are also complexities such as skipping, reuse, conditional use of different values in the enumeration (frequencies, positions, payloads, offsets), partial consumption of the enumeration, etc. These challenges unfortunately make it easy for bugs to sneak in, but an awesome way to ferret out all the bugs is to leverage Lucene's extensive randomized tests: run all tests with -Dtests.postingsformat=XXX (be sure to first register your new postings format). If your new postings format has a bug, tests will most likely fail.

However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.

I recently created a new postings format, BlockPostingsFormat, which will hopefully be more efficient than the Sep codec at using fixed int block encodings. I did this to support Han Jiang's Google Summer of Code project to add a useful int block postings format to Lucene.

So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test, TestPostingsFormat. It has layers of testing (documents, +freqs, +positions, +payloads, +offsets) that you can incrementally enable as you iterate, as well as different test options (skipping or not, reuse or not, stop visiting documents and/or positions early, one or more threads, etc.). When you turn on verbose (-Dtests.verbose=true) the test prints clear details of everything it indexed and what exactly it's testing so a failure is easy to debug. I'm very happy with the results: I found this to be a much more productive way to create a new postings format.

The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in TestPostingsFormat! (Who tests the tester?)

If you find yourself creating a new postings format I strongly suggest using the new TestPostingsFormat during early development to get your postings format off the ground. Once it's passing, run all tests with your new postings format, and if something fails please let us know so we can fix TestPostingsFormat.

22 comments:

  1. what is Posting format ?
    Is it something related to the way string is sent to the searcher ??

    ReplyDelete
  2. The PostingsFormat controls how the inverted index (terms mapping to list of documents that contain that term, plus things like frequency, positions, offsets) is stored in the index.

    ReplyDelete
  3. Thanks Mike Sir for answering.

    ReplyDelete
  4. Mike

    Lucene newbie here. Sorry if my question sounds silly.
    Still trying to understand the terminology related to Postings/PostingFormat/Codec. Are they all same?

    From what you said, if custom postingsformat controls how the data is stored in index, would it be right to say that for the same dataset being loaded into index, different postingformats have different performance times?

    What would be the best postingformat to use if speed is important.

    Also can you point me to documentation/example to build custom postingformat using java?

    Thanks
    John

    ReplyDelete
  5. Hi Anonymous,

    A Codec holds many formats: postings, term vectors, deletions, etc. PostingsFormat just covers how postings (= the inverted part of the index, i.e. fields, terms, docs, positions, freqs, offsets) are written.

    There are very different performance characteristics for each PostingsFormat ... I would say the best one is the current default one (BlockPostingsFormat): it's fast for frequent terms because it bulk encodes/decodes the integers.

    I don't think there's any good document that describes the steps in building a PostingsFormat. I would start by looking at the PostingsFormat impls in the Lucene sources? Typically one builds a PostingsBaseFormat, which plugs into the terms dictionary, because building a new terms dictionary is challenging ... the PostingsBaseFormat only needs to encode the docs/freqs/positions/offsets.

    ReplyDelete
  6. Thanks Mike. Lot clear to me now :)

    I see that Lucene 4 used Lucene40 Postings format by default. Would it be possible to use BlockPostingsFormat and still be on Lucene4 or do we need to upgrade to Lucene4.1?

    -John

    ReplyDelete
  7. Hi John,

    That's a good question ... I'm not sure? If any of the codec APIs changed (which is likely: they are marked experimental) then you'd need to upgrade lucene core as well. Try it and report back ;)

    ReplyDelete
  8. Some advice to all of those asking about what postings are: have a look at the online IR Book at http://www-nlp.stanford.edu/IR-book/ . The first few chapters will help you understand the theory behind postings and inverted index. After reading the IR Book, this blog post makes much more sense! :-) Thanks Mike.

    ReplyDelete
  9. Thanks Gora, the only IR book is a wonderful resource!

    ReplyDelete
  10. Is there documentation about how lucene works internally when a one does a query say "title:Foo" ? e.g. how the tii, tis, prx, fdt, etc.. lookuped and in what order to get the information.

    ReplyDelete
    Replies
    1. I read http://lucene.apache.org/core/3_0_3/fileformats.html but not sure how things work when a query arrives.

      Delete
    2. I don't think there's any documentation for this besides the source code itself; could you ask on the dev list (dev@lucene.apache.org)?

      Delete
  11. Does lucene 4.7 use Lucene41PostingsFormat for Postings.nextdoc() while executing the span queries? My requirement is to run multiple span queries like [cat dog]~2 on 2 TB of index and I am worried about the performance as I have to collect all the docs in results. Is there a better postings format to choose from like memory postings while using span queries in solr 4.7?? Does having termvectors help me in this regard?

    ReplyDelete
    Replies
    1. Yes, Lucene41PF is still the default, so it's used for all queries to iterate docs/positions. MemoryPF isn't really a good match: it's best for primary key fields (each term matches just one document). Term vectors are unrelated: they are not used during searching, and they are quite slow to retrieve per document.

      Delete
    2. Thanks, so is there a better postingsformat to choose from while using span queries in solr 4.6 or solr4.7? Since, span queries utilize all the parts of postings like terms, documents, frequencies, positions and offsets.

      When I am debugging the lucene 4.6 test cases for span queries, it is showing that for above nextdoc() call it is utilizing DirectPostingsFormat.

      Does having termPositions=true and termOffsets=true help?

      Delete
    3. Span queries are unfortunately very costly. Have you tried a sloppy PhraseQuery instead? They will still be costly but perhaps less-so.

      However, even better: do a simpler (non-positional) query and then use the new rescoring APIs to re-sort the top e.g. 500 hits with your positional queries.

      DirectPostingsFormat is a very RAM-heavy, but fast, postings format.

      You don't need termOffsets for span queries; that's typically used for highlighting with PostingsHighlighter.

      Delete
  12. I will convert and run all span queries as conjunction queries and then use the output as filter query for the original span query. That might reduce the scope of span query substantially.

    Thanks for the idea.

    ReplyDelete
  13. Hi Mike,
    We were trying to use a configurable PerFieldPostingsFormat to get an updateable field with a key-value store. Something similar to what you have linked in your post(Flax). However, I have a doubt that an updateable field is possible using codecs for the following reasons:

    Assuming the key for this key-value store is: segmentName_fieldName_term and value is the postingsArray: say just array of docIds.

    Assume there is a field PK (id kind of field) which is stored and uses default codec.

    #1. fieldConsumers of PostingsFormat are called only at flush time. With this design handling an update for a document not yet flushed seems difficult for a field writing to a key-value store.
    So if I add a document with say PK:1 and then add another partial update document for PK:1 then to update this, we first need to figure out the segmentName for this document, only then we can update it in key-value store. But this doc has no segmentName as of now. Because of DWPTs we can't even assume that the document will be in this same segment.

    Note: Since the custom PostingsFormat is only called at flush time, we write the postings data directly to the key-value store.

    #2:Merged segments are just checkpointed, they are not committed. So, for the custom TermsConsumer.merge() call we can not write the new merged state to the key-value store directly. Note that merge() and flush() both use same methods like startTerm(),startDoc(),finishTerm() etc. And since in these methods the design is to write directly to key-value store, we do not use TermsConsumer.merge() function. We actually put the new merged state in-memory but have no good signal to flush it to the key-value store. We are flushing this in-memory merge state at the next flush. However, Lucene can commit the checkpointed merged segment without any document to be flushed. So we get into a state where Lucene Index Dir has new merged segments, the key-value store is not yet updated with that. If we write the merged state directly, then we get into a state where key-value store has the new merged segment but not the Lucene Directory.

    #3 - Dropped Segments. Again can't handle this as the only time our custom consumers can remove deleted docs is at merge time. But Lucene can drop segments during normal commits.

    #4 - Replication: We do take index files and key-value store files for replication. We can not keep both of them in sync.

    #5 - Applying updates to a reindexed document when reindexed document is taken by a DWPT which is flushed after the partial update document. Updates are lost in this case.

    Currently, I plan to try adding SegmentInfosFormat(May be LiveDocsFormat too) and a DirectoryWrapper but as if now I do not have any idea if it will solve all problems.

    It will be great to know your thoughts on this on the feasibility of providing updateable fields using codecs.

    ReplyDelete
    Replies
    1. Hi Aditya,

      Sorry for the slow response here; if you haven't already, could you ask this question on the Lucene users list instead? (java-user@lucene.apache.org).

      Delete
    2. Hi Mike,
      Thanks a lot for your reply. I could solve some of the easier issues listed above. Will put a new question based on the remaining issues.

      Delete
  14. Hello Mike,
    I'm starting to study Lucene, and I'm curious to see how the inverted index of the document gets after indexing some files.
    I would like to know what steps I should take and which classes to use.
    Thanks if you can respond.

    ReplyDelete
    Replies
    1. Hi Igor, I suggest starting with Lucene's "demo" module, specifically IndexFiles.java and SearchFiles.java. And then send questions to Lucene's users list (java-user@lucene.apache.org).

      Delete