Changing Bits: July 2012

Tuesday, July 31, 2012

Lucene index in RAM with Azul's Zing JVM

Google's entire index has been in RAM for at least 5 years now. Why not do the same with an Apache Lucene search index?

RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.

The obvious approach is to load the index into Lucene's RAMDirectory, right?

Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a List of byte[1024] fragments (there are open Jira issues to address this but they haven't been committed yet). It also has unnecessary synchronization. If the application is updating the index (not just searching), another challenge is how to persist ongoing changes from RAMDirectory back to disk. Startup is much slower as the index must first be loaded into RAM. Given these problems, Lucene developers generally recommend using RAMDirectory only for small indices or for testing purposes, and otherwise trusting the operating system to manage RAM by using MMapDirectory (see Uwe's excellent post for more details).

While there are open issues to improve RAMDirectory (LUCENE-4123 and LUCENE-3659), they haven't been committed and many users simply use RAMDirectory anyway.

Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of RAMDirectory should not be a problem for Zing. Let's test it! But first, a quick digression on the importance of measuring search response time of all requests.

Search response time percentiles

Normally our performance testing scripts (luceneutil) measure the average response time, discarding outliers. We do this because we are typically testing an algorithmic change and want to see the effect of that change, ignoring confounding effects due to unrelated pauses from the the OS, IO systems, or GC, etc.

But for a real search application what matters is the total response time of every search request. Search is a fundamentally interactive process: the user sits and waits for the results to come back and then iterates from there. If even 1% of the searches take too long to respond that's a serious problem! Users are impatient and will quickly move on to your competitors.

So I made some improvements to luceneutil, including separating out a load testing client (sendTasks.py) that records the response time of all queries, as well as scripts (loadGraph.py and responseTimeGraph.py) to generate the resulting response-time graphs, and a top-level responseTimeTests.py to run a series of tests at increasing loads (queries/sec), automatically stopping once the load is clearly beyond the server's total capacity. As a nice side effect, this also determines the true capacity (max throughput) of the server rather than requiring an approximate measure by extrapolating from the average search latency.

Queries are sent according to a Poisson distribution, to better model the arrival times of real searches, and the client is thread-less so that if you are testing at 200 queries/sec and the server suddenly pauses for 5 seconds then there will be 1000 queries queued up once it wakes up again (this fixes an unfortunately common bug in load testers that dedicate one thread per simulated client).

The client can run (via password-less ssh) on a separate machine; this is important because if the server machine itself (not just the JVM) is experiencing system-wide pauses (e.g. due to heavy swapping) it can cause pauses in the load testing client which will skew the results. Ideally the client runs on an otherwise idle machine and experiences no pauses. The client even disables Python's cyclic garbage collector to further reduce the chance of pauses.

Wikipedia tests

To test Zing, I first indexed the full Wikipedia English database (as of 5/2/2012), totalling 28.78 GB plain text across 33.3 M 1 KB sized documents, including stored fields and term vectors, so I could highlight the hits using FastVectorHighlighter. The resulting index was 78 GB. For each test, the server loads the entire index into RAMDirectory and then the client sends the top 500 hardest (highest document frequency) TermQuerys, including stop words, to the server. At the same time, the server re-indexes documents (updateDocument) at the rate of 100/sec (~100 KB/sec), and reopens the searcher once per second.

Each test ran for an hour, discarding results for the first 5 minutes to allow for warmup. Max heap is 140 GB (-Xmx 140G). I also tested MMapDirectory, with max heap of 4 GB, as a baseline. The machine has 32 cores (64 with hyper-threading) and 512 GB of RAM, and the server ran with 40 threads.

I tested at varying load rates (queries/sec), from 50 (very easy) to 275 (too hard), and then plotted the resulting response time percentiles for different configurations. The default Oracle GC (Parallel) was clearly horribly slow (10s of seconds collection time) so I didn't include it. The experimental garbage first (G1) collector was even slower starting up (took 6 hours to load the index into RAMDirectory, vs 900 seconds for Concurrent Mark/Sweep (CMS)), and then the queries hit > 100 second latencies, so I also left it out (this was surprising as G1 is targeted towards large heaps). The three configurations I did test were CMS at its defaults settings, with MMapDirectory as the baseline, and both CMS and Zing with RAMDirectory.

At the lightest load (50 QPS), Zing does a good job maintaining low worst-case response time, while CMS shows long worst case response times, even with MMapDirectory:

To see the net capacity of each configuration, I plotted the 99% response time, across different load rates:

From this it's clear that the peak throughput for

CMS +
MMap

was somewhere between 100 and 150 queries/sec, while the RAMDirectory based indices were somewhere between 225 and 250 queries/second. This is an impressive performance gain! It's also interesting because in separately testing RAMDirectory vs MMapDirectory I usually only see minor gains when measuring average query latency.

Plotting the same graph, without CMS + MMapDirectory and removing the 275 queries/second point (since it's over-capacity):

Zing remains incredibly flat at the 99% percentile, while CMS has high response times already at 100 QPS. At 225 queries/sec load, the highest near-capacity rate, for just CMS and Zing on RAMDirectory:

The pause times for CMS are worse than they were at 50 QPS: already at the 95% percentile the response times are too slow (4479 milli-seconds).

Zing works!

It's clear from these tests that Zing really has a very low-pause garbage collector, even at high loads, while managing a 140 GB max heap with 78 GB Lucene index loaded into RAMDirectory. Furthermore, applications can expect a substantial increase in max throughput (around 2X faster in this case) and need not fear using RAMDirectory even for large indices, if they can run under Zing.

Note that Azul has just made the Zing JVM freely available to open-source developers, so now we all can run our own experiments, add Zing into the JVM rotation for our builds, etc.

Next I will test the new DirectPostingsFormat which holds all postings in RAM in simple arrays (no compression) for fast search performance. It requires even more RAM than this test, but gives even faster search performance!

Thank you to Azul for providing a beefy server machine and the Zing JVM to run these tests!

Sunday, July 29, 2012

Building a new Lucene postings format

As of 4.0 Lucene has switched to a new pluggable codec architecture, giving the application full control over the on-disk format of all index files. We have a nice collection of builtin codec components, and developers can create their own such as this recent example using a Redis back-end to hold updatable fields. This is an important change since it removes the previous sizable barriers to innovating on Lucene's index formats.

A codec is actually a collection of formats, one for each part of the index. For example, StoredFieldsFormat handles stored fields, NormsFormat handles norms, etc. There are eight formats in total, and a codec could simply be a new mix of pre-existing formats, or perhaps you create your own TermVectorsFormat and otherwise use all the formats from the Lucene40 codec, for example.

The trickiest format to create is PostingsFormat, which provides read/write access to all postings (fields, terms, documents, frequencies, positions, offsets, payloads). Part of the challenge is that it has a large API surface area. But there are also complexities such as skipping, reuse, conditional use of different values in the enumeration (frequencies, positions, payloads, offsets), partial consumption of the enumeration, etc. These challenges unfortunately make it easy for bugs to sneak in, but an awesome way to ferret out all the bugs is to leverage Lucene's extensive randomized tests: run all tests with -Dtests.postingsformat=XXX (be sure to first register your new postings format). If your new postings format has a bug, tests will most likely fail.

However, when a test does fail, it's a lot of work to dig into the specific failure to understand what went wrong, and some tests are more challenging than others. My favorite is the innocently named TestBasics! Furthermore, it would be nice to develop the postings format iteratively: first get only documents working, then add freqs, positions, payloads, offsets, etc. Yet we have no way to run only the subset of tests that don't require positions, for example. So today you have to code up everything before iterating. Net/net our tests are not a great fit for the early iterations when developing a new postings format.

I recently created a new postings format, BlockPostingsFormat, which will hopefully be more efficient than the Sep codec at using fixed int block encodings. I did this to support Han Jiang's Google Summer of Code project to add a useful int block postings format to Lucene.

So, I took the opportunity to address this problem of easier early-stage iterations while developing a new postings format by creating a new test, TestPostingsFormat. It has layers of testing (documents, +freqs, +positions, +payloads, +offsets) that you can incrementally enable as you iterate, as well as different test options (skipping or not, reuse or not, stop visiting documents and/or positions early, one or more threads, etc.). When you turn on verbose (-Dtests.verbose=true) the test prints clear details of everything it indexed and what exactly it's testing so a failure is easy to debug. I'm very happy with the results: I found this to be a much more productive way to create a new postings format.

The goal of this test is to be so thorough that if it passes with your posting format then all Lucene's tests should pass. If ever we find that's not the case then I consider that a bug in TestPostingsFormat! (Who tests the tester?)

If you find yourself creating a new postings format I strongly suggest using the new TestPostingsFormat during early development to get your postings format off the ground. Once it's passing, run all tests with your new postings format, and if something fails please let us know so we can fix TestPostingsFormat.

Tuesday, July 3, 2012

Lucene 4.0.0 alpha, at long last!

The 4.0.0 alpha release of Lucene and Solr is finally out!

This is a major release with lots of great changes. Here I briefly describe the most important Lucene changes, but first the basics:

All deprecated APIs as of 3.6.0 have been removed.
Pre-3.0 indices are no longer supported.
MIGRATE.txt describes how to update your application code.
The index format won't change (unless a serious bug fix requires it) between this release and 4.0 GA, but APIs may still change before 4.0.0 beta.

Please try the release and report back!

Pluggable Codec

The biggest change is the new pluggable Codec architecture, which provides full control over how all elements (terms, postings, stored fields, term vectors, deleted documents, segment infos, field infos) of the index are written. You can create your own or use one of the provided codecs, and you can customize the postings format on a per-field basis.

There are some fun core codecs:

Lucene40 is the default codec.
Lucene3x (read-only) reads any index written with Lucene 3.x.
SimpleText stores everything in plain text files (great for learning and debugging, but awful for production!).
MemoryPostingsFormat stores all postings (terms, documents, positions, offsets) in RAM as a fast and compact FST, useful for fields with limited postings (primary key (id) field, date field, etc.)
PulsingPostingsFormat inlines postings for low-frequency terms directly into the terms dictionary, saving a disk seek on lookup.
AppendingCodec avoids seeking while writing, necessary for file-systems such as Hadoop DFS.

If you create your own Codec it's easy to confirm all of Lucene/Solr's tests pass with it. If tests fail then likely your Codec has a bug!

A new 4-dimensional postings API (to read fields, terms, documents, positions) replaces the previous postings API.

Flexible scoring

Lucene's scoring is now fully pluggable, with the TF/IDF vector space model remaining as the default. You can create your own scoring model, or use one of the core scoring models (BM25, Divergence from Randomness, Language Models, and Information-based models). Per-document normalization values are no longer limited to a single byte. Various new aggregate statistics are now available.

These changes were part of a 2011 Google Summer of Code project (thank you David!).

These two changes are really important because they remove the barriers to ongoing innovations. Now it's easy to experiment with wild changes to the index format or to Lucene's scoring models. A recent example of such innovation is this neat codec by the devs at Flax to enable updatable fields by storing postings in a Redis key/value store.

Document Values

The new document values API stores strongly typed single-valued fields per document, meant as an eventual replacement for Lucene's field cache. The values are pre-computed during indexing and stored in the index in a column-stride format (values for a single field across all documents are stored together), making it much faster to initialize at search time than the field cache. Values can be fixed 8, 16, 32, 64 bit ints, or variable-bits sized (packed) ints; float or double; and six flavors of byte[] (fixed size or variable sized; dereferenced, straight or sorted).

New Field APIs

The API for creating document fields has changed: Fieldable and AbstractField have been removed, and a new FieldType, factored out of Field class, holds details about how the field's value should be indexed. New classes have been created for specific commonly-used fields:

StringField indexes a string as a single token, without norms and as docs only. For example, use this for a primary key (id) field, or for a field you will sort on.
TextField indexes the fully tokenized string, with norms and including docs, term frequencies and positions. For example, use this for the primary text field.
StoredField is a field whose value is just stored.
XXXDocValuesField create typed document values fields.
IntField, FloaField, LongField, DoubleField create typed numeric fields for efficient range queries and filters.

If none of these field classes apply, you can always create your own FieldType (typically by starting from the exposed FieldTypes from the above classes and then tweaking), and then construct a Field by passing the name, FieldType and value.

Note that the old APIs (using Index, Store, TermVector enums) are still present (deprecated), to ease migration.

These changes were part of a 2011 Google Summer of Code project (thank you Nikola!).

Other big changes

Lucene's terms are now binary (arbitrary byte[]); by default they are UTF-8 encoded strings, sorted in Unicode sort order. But your Analyzer is free to produce tokens with an arbitrary byte[] (e.g., CollationKeyAnalyzer does so).

A new DirectSpellChecker finds suggestions directly from any Lucene index, avoiding the hassle of maintaining a sidecar spellchecker index. It uses the same fast Levenshtein automata as FuzzyQuery (see below).

Term offsets (the start and end character position of each term) may now be stored in the postings, by using FieldInfo.IndexOption.DOCS_AND_POSITIONS_AND_OFFSETS when indexing the field. I expect this will be useful for fast highlighting without requiring term vectors, but this part is not yet done (patches welcome!).

A new AutomatonQuery matches all documents containing any term matching a provided automaton. Both WildcardQuery and RegexpQuery simply construct the corresponding automaton and then run AutomatonQuery. The classic QueryParser produces a RegexpQuery if you type fieldName:/expression/ or

/expression against
default field/

.

Optimizations

Beyond the fun new features there are some incredible performance gains.

If you use FuzzyQuery, you should see a factor of 100-200 speedup on moderately sized indices.

If you search with a Filter, you can see gains up to 3X faster (depending on filter density and query complexity), thanks to a change that applies filters just like we apply deleted documents.

If you use multiple threads for indexing, you should see stunning throughput gains (265% in that case), thanks to concurrent flushing. You are also now able to use more than 2048 MB IndexWriter RAM buffer (as long as you use multiple threads).

The new BlockTree default terms dictionary uses far less RAM to hold the terms index, and can sometimes avoid going to disk for terms that do not exist. In addition, the field cache also uses substantially less RAM, by avoiding separate objects per document and instead packing character data into shared byte[] blocks. Together this results in a 73% reduction in RAM required for searching in one case.

IndexWriter now buffers term data using byte[] instead of char[], using half the RAM for ASCII terms.

MultiTermQuery now rewrites per-segment, and caches per-term metadata to avoid a second lookup during scoring. This should improve performance though it hasn't been directly tested.

If a BooleanQuery consist of only MUST TermQuery clauses, then a specialized ConjunctionTermScorer is used, giving ~25% speedup.

Reducing merge IO impact

Merging (consolidating many small segments into a single big one) is a very IO and CPU intensive operation which can easily interfere with ongoing searches. In 4.0.0 we now have two ways to reduce this impact:

Rate-limit the IO caused by ongoing merging, by calling FSDirectory.setMaxMergeWriteMBPerSec.
Use the new NativeUnixDirectory which bypasses the OS's IO cache for all merge IO, by using direct IO. This ensures that a merge won't evict hot pages used by searches. (Note that there is also a native WindowsDirectory, but it does not yet use direct IO during merging... patches welcome!).

Remember to also set swappiness to 0 on Linux if you want to maximize search responsiveness.

More generally, the APIs that open an input or output file (Directory.openInput and Directory.createOutput) now take an IOContext describing what's being done (e.g., flush vs merge), so you can create a custom Directory that changes its behavior depending on the context.

These changes were part of a 2011 Google Summer of Code project (thank you Varun!).

Consolidated modules

The diverse sources, previously scattered between Lucene's and Solr's core and contrib, have been consolidated. Especially noteworthy is the analysis module, providing a rich selection of 48 analyzers across many languages; the queries module, containing function queries (the old core function queries have been removed) and other non-core query classes; and the queryparser module, with numerous query parsers including the classic QueryParser (moved from core).

Other changes

Here's a long list of additional changes:

The classic QueryParser now interprets term~N where N is an integer >= 1 as a FuzzyQuery with edit distance N.
The field cache normally requires single-valued fields, but we've added FieldCache.getDocTermsOrd which can handle multi-valued fields.
Analyzers must always provide a reusable token stream, by implementing the Analyzer.createComponents method (reusableTokenStream has been removed and tokenStream is now final, in Analzyer).
IndexReaders are now read-only (you cannot delete document by id, nor change norms) and are strongly typed as AtomicIndexReader or CompositeIndexReader
The API for reading term vectors is the same API used for reading all postings, except the term vector API only covers a single document. This is a good match because term vectors are really just a single-document inverted index.
Positional queries (PhraseQuery, SpanQuery) will now throw an exception if you run them against a field that did not index positions (previously they silently returned 0 hits).
String-based field cache APIs have been replaced with BytesRef based APIs.
ParallelMultiSearcher has been absorbed into IndexSearcher as an optional ExecutorService argument to the constructor. Searcher and Searchable have been removed.
All serialization code has been removed from Lucene's classes; you must handle serialization at a higher level in your application.
Field names are no longer interned, so you cannot rely on == to test for equality (use .equals instead).
*SpanFilter has been removed: they created too many objects during searching and were not scalable.
Removed IndexSearcher.close: IndexSearcher now only takes a provided IndexReader (no longer a Directory), which is the caller's responsibility to close.
You cannot put foreign files into the index directory anymore: they will be deleted by IndexWriter.
FieldSelector (to only load certain stored fields) has been replaced with a simpler StoredFieldVisitor API.

Happy searching!