Thursday, June 30, 2011

Primary key lookups are 2.8X faster with MemoryCodec

A few days ago I committed the new MemoryCodec to Lucene's trunk (to be 4.0). This codec indexes all terms and postings into a compact finite-state transducer (FST) and then, at search time, avoids I/O by performing all terms and postings enumerations in memory using the FST.

If your application needs fast primary-key lookups, and you can afford the required additional memory, this codec might be a good match for the id field. To test this, I switched Lucene's nightly benchmark to use MemoryCodec (just for its id field), and performance jumped from around 179 K to 509 K lookups per second:



This is an awesome improvement! It's particularly impressive as the id field was previously indexed using PulsingCodec, which was already faster than the default StandardCodec.

This is the performance for a single thread, and should scale up linearly if you use multiple threads. Each lookup resolves 4,000 keys in order at once from the id field, performing the lookups segment by segment for best performance (see the source code). The index has 27.6 M docs across multiple segments.

Of course, there is added memory required, specifically 188 MB for this index, which works out to 7.1 bytes per document on average.

There are two sources of MemoryCodec's gains. First, the obvious one: since everything is in memory, you never wait for an I/O seek operation, as long as you ensure the sneaky OS never swaps out your process memory.

Second, I separately added a new seekExact API to TermsEnum, enabling codecs to save CPU if the caller does not need to know the following term when the target term doesn't exist, as is the case here. MemoryCodec has an optimized implementation for seekExact (and so does the cool SimpleTextCodec!). Eventually other codecs should as well, by using the block tree terms index, but we're not there yet.

The id field in the nightly benchmark omits term freq and positions, however MemoryCodec is fully general: you can use it for any field (not just primary-key), storing positions, payloads, etc. Also, its values are zero-padded sequential integers (00000001, 00000002, 00000003, etc.), which is likely important for performance as it allows maximal sharing in the FST. I haven't tested but I suspect had I used something more random, such as GUIDs, memory usage would be higher and lookup performance worse as each segment's FST would be less dense (share less).

Of course, Lucene is not a database, and you normally use it for its fast search performance, not primary-key lookups. The one common search use case where you do require primary-key lookups is during indexing, when deleting or updating documents by an id field. Near-realtime search with updates or deletions relies on this, since the deleted documents must be resolved during reopen, so we also see a healthy speedup in the NRT reopen time:



The NRT latencey dropped from around 52 milliseconds to 43 milliseconds, a 17% improvement. This is "only" 17% because opening a new reader must also do other things like flush the indexed documents as a new segment.

Perhaps more importantly, the variance also dropped substantially, which is expected because with MemoryCodec and NRTCachingDirectory, NRT reopen is fully I/O free (performs no reads or writes when opening a new reader).

One limitation of MemoryCodec is it's an all-or-nothing deal: all terms and postings are in memory, or they aren't. LUCENE-3069, still to be done (any volunteers?), aims to fix this, by enabling you to separately choose whether terms and/or postings data should be in memory.

I suspect an even more specialized codec, for example one that requires the field values to be compact integers, and also requires that the values are unique (only supports primary-key fields), could do even better than MemoryCodec by storing the mapping in global (across all segments) parallel arrays. Such a codec would no longer be general; it'd only work for primary-key fields whose values are compact integers. But it'd have faster lookups than MemoryCodec and should use less memory per document. This codec could simply wrap any other codec, i.e. it would create the arrays on reader initialization, and delegate persisting the postings into the index to the wrapped codec.

10 comments:

  1. Great blog Mike,
    this gets better and better :)

    Re. "or-or-nothing deal". It is not only about postings, but Fields as well (or even terms if we go crazy)... If memory is scarce, caching 5% top Terms, brings 90% speed-up...

    In general, to put all this 4.0 flexibility to good use, some smart configuration possibilities have to be added, e.g.

    - What exactly gets loaded into ram (fields/terms/postings & co)
    - Which codec is used for what (Pulsing for uniqueID field, the rest with *BlockCodec). Dense postings with X, sparse with Y.
    - How this "in memory part" can be reloaded (master slave setup, where updates to index being searched happen externally)
    - not to mention column stride fields...

    ...progress, not perfection as you preach

    regards,
    eks dev

    ReplyDelete
  2. Hey Mike,
    Can you please tell me what are Doc Values and Their Uses?

    ReplyDelete
    Replies
    1. Hi Anonymous,

      Doc values allow you to assign a typed (sorted, numeric, binary) value or values per field for each document; they are stored efficiently in "column stride" format so that all values for a single field across all documents are concatenated together. This is often used for scoring factors, e.g. Lucene's norms are in fact a doc values field under the hood.

      Delete
  3. Hey Mike,
    Is there any way to use MemoryPosting Format for my whole index, I mean instead of specifying Codec attribute on per field basis can I specify something in my solrconfig.xml or schema.xml that will use Memory Posting Format for all indexed fields.If there is any way let me know. I know this will take lot of my memory but in my case memory issue is not there all i want is to return results as fast as possible.
    Thanks.

    ReplyDelete
    Replies
    1. Hi Anonymous,

      Be careful: MemoryPF is best when the number of documents per term is very low (e.g., only 1 for the primary key case). For a more ordinary "body" field, the overhead when building the FST and in traversing it, and the lack of advance, will likely hurt you. Test it and see.

      I'm not sure how Solr lets you set MemoryPF for all fields; try asking on solr-user@lucene.apache.org?

      You might want to try DirectPostingsFormat: it uses tons of memory, since it uses simple Java int[] to hold postings, but it's also very, very fast.

      Or, on trunk, you could try the FSTPostingsFormat, which holds the entire terms dict in RAM, but leaves the postings on disk.

      Delete
  4. Hey Mike,
    I want to use this codec in my schema.I am using Solr 4.2 version and I am confuse whether the jars for this codec is already included in the package or do we have to build it from the trunk.
    OR
    I have given postingsFormat="Memory" as the attribute for the fieldtype but was not getting any error so this means that my field is using this codec?

    ReplyDelete
    Replies
    1. Hi Anonymous,

      Since my blog post, this format was renamed to MemoryPostingsFormat, so I suspect that adding postingsFormat="Memory" to your fieldType means you are successfully using it.

      Delete
  5. Hey Mike,
    I am using this format on a text field(not primary field it is a data field) in core A and other core B is not using this format .
    I am running around 30000 queries 2000 per minute to test this format and I found that core A average query time is more than core B which is not using this format.
    So,my question is Is this format only useful for primary key Or for other fields also?
    If not for other fields what might be the reason?

    ReplyDelete
    Replies
    1. In fact, MemoryPostingsFormat is a very bad choice for "normal" text fields: the way the postings are held in an FST makes it quite costly to iterate the postings. And there is no fast .advance() implementation (it's just a linear scan). Really you should only use this for fields where you know each term will occur in a tiny number of documents.

      Delete