Sunday, July 29, 2012

Building a new Lucene postings format

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene's index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

The trickiest format to create is PostingsFormat, which provides read/write access to all postings (fields, terms, documents, frequencies, positions, offsets, payloads). Part of the challenge is that it has a large API surface area. But there are also complexities such as skipping, reuse, conditional use of different values in the enumeration (frequencies, positions, payloads, offsets), partial consumption of the enumeration, etc. These challenges unfortunately make it easy for bugs to sneak in, but an awesome way to ferret out all the bugs is to leverage Lucene's extensive randomized tests: run all tests with -Dtests.postingsformat=XXX (be sure to first register your new postings format). If your new postings format has a bug, tests will most likely fail.

However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.

I recently created a new postings format, BlockPostingsFormat, which will hopefully be more efficient than the Sep codec at using fixed int block encodings. I did this to support Han Jiang's Google Summer of Code project to add a useful int block postings format to Lucene.

So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test, TestPostingsFormat. It has layers of testing (documents, +freqs, +positions, +payloads, +offsets) that you can incrementally enable as you iterate, as well as different test options (skipping or not, reuse or not, stop visiting documents and/or positions early, one or more threads, etc.). When you turn on verbose (-Dtests.verbose=true) the test prints clear details of everything it indexed and what exactly it's testing so a failure is easy to debug. I'm very happy with the results: I found this to be a much more productive way to create a new postings format.

The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in TestPostingsFormat! (Who tests the tester?)

If you find yourself creating a new postings format I strongly suggest using the new TestPostingsFormat during early development to get your postings format off the ground. Once it's passing, run all tests with your new postings format, and if something fails please let us know so we can fix TestPostingsFormat.

17 comments:

  1. what is Posting format ?
    Is it something related to the way string is sent to the searcher ??

    ReplyDelete
  2. The PostingsFormat controls how the inverted index (terms mapping to list of documents that contain that term, plus things like frequency, positions, offsets) is stored in the index.

    ReplyDelete
  3. Mike

    Lucene newbie here. Sorry if my question sounds silly.
    Still trying to understand the terminology related to Postings/PostingFormat/Codec. Are they all same?

    From what you said, if custom postingsformat controls how the data is stored in index, would it be right to say that for the same dataset being loaded into index, different postingformats have different performance times?

    What would be the best postingformat to use if speed is important.

    Also can you point me to documentation/example to build custom postingformat using java?

    Thanks
    John

    ReplyDelete
  4. Hi Anonymous,

    A Codec holds many formats: postings, term vectors, deletions, etc. PostingsFormat just covers how postings (= the inverted part of the index, i.e. fields, terms, docs, positions, freqs, offsets) are written.

    There are very different performance characteristics for each PostingsFormat ... I would say the best one is the current default one (BlockPostingsFormat): it's fast for frequent terms because it bulk encodes/decodes the integers.

    I don't think there's any good document that describes the steps in building a PostingsFormat. I would start by looking at the PostingsFormat impls in the Lucene sources? Typically one builds a PostingsBaseFormat, which plugs into the terms dictionary, because building a new terms dictionary is challenging ... the PostingsBaseFormat only needs to encode the docs/freqs/positions/offsets.

    ReplyDelete
  5. Thanks Mike. Lot clear to me now :)

    I see that Lucene 4 used Lucene40 Postings format by default. Would it be possible to use BlockPostingsFormat and still be on Lucene4 or do we need to upgrade to Lucene4.1?

    -John

    ReplyDelete
  6. Hi John,

    That's a good question ... I'm not sure? If any of the codec APIs changed (which is likely: they are marked experimental) then you'd need to upgrade lucene core as well. Try it and report back ;)

    ReplyDelete
  7. Some advice to all of those asking about what postings are: have a look at the online IR Book at http://www-nlp.stanford.edu/IR-book/ . The first few chapters will help you understand the theory behind postings and inverted index. After reading the IR Book, this blog post makes much more sense! :-) Thanks Mike.

    ReplyDelete
  8. Thanks Gora, the only IR book is a wonderful resource!

    ReplyDelete
  9. Is there documentation about how lucene works internally when a one does a query say "title:Foo" ? e.g. how the tii, tis, prx, fdt, etc.. lookuped and in what order to get the information.

    ReplyDelete
    Replies
    1. I read http://lucene.apache.org/core/3_0_3/fileformats.html but not sure how things work when a query arrives.

      Delete
    2. I don't think there's any documentation for this besides the source code itself; could you ask on the dev list (dev@lucene.apache.org)?

      Delete
  10. Does lucene 4.7 use Lucene41PostingsFormat for Postings.nextdoc() while executing the span queries? My requirement is to run multiple span queries like [cat dog]~2 on 2 TB of index and I am worried about the performance as I have to collect all the docs in results. Is there a better postings format to choose from like memory postings while using span queries in solr 4.7?? Does having termvectors help me in this regard?

    ReplyDelete
    Replies
    1. Yes, Lucene41PF is still the default, so it's used for all queries to iterate docs/positions. MemoryPF isn't really a good match: it's best for primary key fields (each term matches just one document). Term vectors are unrelated: they are not used during searching, and they are quite slow to retrieve per document.

      Delete
    2. Thanks, so is there a better postingsformat to choose from while using span queries in solr 4.6 or solr4.7? Since, span queries utilize all the parts of postings like terms, documents, frequencies, positions and offsets.

      When I am debugging the lucene 4.6 test cases for span queries, it is showing that for above nextdoc() call it is utilizing DirectPostingsFormat.

      Does having termPositions=true and termOffsets=true help?

      Delete
    3. Span queries are unfortunately very costly. Have you tried a sloppy PhraseQuery instead? They will still be costly but perhaps less-so.

      However, even better: do a simpler (non-positional) query and then use the new rescoring APIs to re-sort the top e.g. 500 hits with your positional queries.

      DirectPostingsFormat is a very RAM-heavy, but fast, postings format.

      You don't need termOffsets for span queries; that's typically used for highlighting with PostingsHighlighter.

      Delete
  11. I will convert and run all span queries as conjunction queries and then use the output as filter query for the original span query. That might reduce the scope of span query substantially.

    Thanks for the idea.

    ReplyDelete