Changing Bits: Primary key lookups are 2.8X faster with MemoryCodec

Thursday, June 30, 2011

Primary key lookups are 2.8X faster with MemoryCodec

A few days ago I committed the new MemoryCodec to Lucene's trunk (to be 4.0). This codec indexes all terms and postings into a compact finite-state transducer (FST) and then, at search time, avoids I/O by performing all terms and postings enumerations in memory using the FST.

If your application needs fast primary-key lookups, and you can afford the required additional memory, this codec might be a good match for the id field. To test this, I switched Lucene's nightly benchmark to use MemoryCodec (just for its id field), and performance jumped from around 179 K to 509 K lookups per second:

This is an awesome improvement! It's particularly impressive as the id field was previously indexed using PulsingCodec, which was already faster than the default StandardCodec.

This is the performance for a single thread, and should scale up linearly if you use multiple threads. Each lookup resolves 4,000 keys in order at once from the id field, performing the lookups segment by segment for best performance (see the source code). The index has 27.6 M docs across multiple segments.

Of course, there is added memory required, specifically 188 MB for this index, which works out to 7.1 bytes per document on average.

There are two sources of MemoryCodec's gains. First, the obvious one: since everything is in memory, you never wait for an I/O seek operation, as long as you ensure the sneaky OS never swaps out your process memory.

Second, I separately added a new seekExact API to TermsEnum, enabling codecs to save CPU if the caller does not need to know the following term when the target term doesn't exist, as is the case here. MemoryCodec has an optimized implementation for seekExact (and so does the cool SimpleTextCodec!). Eventually other codecs should as well, by using the block tree terms index, but we're not there yet.

The id field in the nightly benchmark omits term freq and positions, however MemoryCodec is fully general: you can use it for any field (not just primary-key), storing positions, payloads, etc. Also, its values are zero-padded sequential integers (00000001, 00000002, 00000003, etc.), which is likely important for performance as it allows maximal sharing in the FST. I haven't tested but I suspect had I used something more random, such as GUIDs, memory usage would be higher and lookup performance worse as each segment's FST would be less dense (share less).

Of course, Lucene is not a database, and you normally use it for its fast search performance, not primary-key lookups. The one common search use case where you do require primary-key lookups is during indexing, when deleting or updating documents by an id field. Near-realtime search with updates or deletions relies on this, since the deleted documents must be resolved during reopen, so we also see a healthy speedup in the NRT reopen time:

The NRT latencey dropped from around 52 milliseconds to 43 milliseconds, a 17% improvement. This is "only" 17% because opening a new reader must also do other things like flush the indexed documents as a new segment.

Perhaps more importantly, the variance also dropped substantially, which is expected because with MemoryCodec and NRTCachingDirectory, NRT reopen is fully I/O free (performs no reads or writes when opening a new reader).

One limitation of MemoryCodec is it's an all-or-nothing deal: all terms and postings are in memory, or they aren't. LUCENE-3069, still to be done (any volunteers?), aims to fix this, by enabling you to separately choose whether terms and/or postings data should be in memory.

I suspect an even more specialized codec, for example one that requires the field values to be compact integers, and also requires that the values are unique (only supports primary-key fields), could do even better than MemoryCodec by storing the mapping in global (across all segments) parallel arrays. Such a codec would no longer be general; it'd only work for primary-key fields whose values are compact integers. But it'd have faster lookups than MemoryCodec and should use less memory per document. This codec could simply wrap any other codec, i.e. it would create the arrays on reader initialization, and delegate persisting the postings into the index to the wrapped codec.

14 comments:

Eks DevAugust 3, 2011 at 1:19 PM
Great blog Mike,
this gets better and better :)

Re. "or-or-nothing deal". It is not only about postings, but Fields as well (or even terms if we go crazy)... If memory is scarce, caching 5% top Terms, brings 90% speed-up...

In general, to put all this 4.0 flexibility to good use, some smart configuration possibilities have to be added, e.g.

- What exactly gets loaded into ram (fields/terms/postings & co)
- Which codec is used for what (Pulsing for uniqueID field, the rest with *BlockCodec). Dense postings with X, sparse with Y.
- How this "in memory part" can be reloaded (master slave setup, where updates to index being searched happen externally)
- not to mention column stride fields...

...progress, not perfection as you preach

regards,
eks dev
ReplyDelete
Replies
AnonymousJanuary 22, 2014 at 1:02 AM
Hey Mike,
Can you please tell me what are Doc Values and Their Uses?
ReplyDelete
Replies
AnonymousJanuary 27, 2014 at 12:18 AM
Hey Mike,
Is there any way to use MemoryPosting Format for my whole index, I mean instead of specifying Codec attribute on per field basis can I specify something in my solrconfig.xml or schema.xml that will use Memory Posting Format for all indexed fields.If there is any way let me know. I know this will take lot of my memory but in my case memory issue is not there all i want is to return results as fast as possible.
Thanks.
ReplyDelete
Replies
AnonymousJanuary 31, 2014 at 12:22 AM
Hey Mike,
I want to use this codec in my schema.I am using Solr 4.2 version and I am confuse whether the jars for this codec is already included in the package or do we have to build it from the trunk.
OR
I have given postingsFormat="Memory" as the attribute for the fieldtype but was not getting any error so this means that my field is using this codec?
ReplyDelete
Replies
AnonymousMarch 5, 2014 at 9:39 AM
Hey Mike,
I am using this format on a text field(not primary field it is a data field) in core A and other core B is not using this format .
I am running around 30000 queries 2000 per minute to test this format and I found that core A average query time is more than core B which is not using this format.
So,my question is Is this format only useful for primary key Or for other fields also?
If not for other fields what might be the reason?
ReplyDelete
Replies
GaneshSeptember 29, 2014 at 5:22 PM
We tried using postingFormat=Memory for one of key in solr index (though not 100% unique key, it does not repeat in more than in 3 docs) along with class="solr.StrField" postingsFormat="Memory" sortMissingLast="true" omitNorms="true" field type param in solr 4.6.1. The data will be numeric though fiedtype is declared as string. We are seeing degradation (2x) after changing it to 'Memory' Posting format. I had expected the performance to improve, as per your post.

Our index is about 15M docs and distributed. We see that have free -g showing around 60+gb free. Also the test was done after the index was warmed up as well.
Any idea what could have gone wrong?
ReplyDelete
Replies
GaneshOctober 6, 2014 at 2:27 PM
Also there is new ID Posting format https://issues.apache.org/jira/browse/LUCENE-5675. does it supersede Memory posting format? How does these two compare.

Can ID posting format be used from Solr?
ReplyDelete
Replies

Add comment