RAM has become very affordable recently, so for high-traffic sites the performance gains from holding the entire index in RAM should quickly pay for the up-front hardware cost.
The obvious approach is to load the index into Lucene's
Unfortunately, this class is known to put a heavy load on the garbage collector (GC): each file is naively held as a
byte fragments (there are open Jira issues to
address this but they haven't been committed yet). It also has unnecessary
synchronization. If the application is updating the index (not just
searching), another challenge is how to persist ongoing changes
RAMDirectory back to disk. Startup is much slower
as the index must first be loaded into RAM. Given these problems,
Lucene developers generally recommend using
only for small indices or for testing purposes, and otherwise trusting
the operating system to manage RAM by using
excellent post for more details).
While there are open issues to improve
they haven't been committed and many users simply
Recently I heard about the Zing JVM, from Azul, which provides a pauseless garbage collector even for very large heaps. In theory the high GC load of
RAMDirectory should not be a problem for
Zing. Let's test it! But first, a quick digression on the importance
of measuring search response time of all requests.
Search response time percentiles
Normally our performance testing scripts (luceneutil) measure the average response time, discarding outliers. We do this because we are typically testing an algorithmic change and want to see the effect of that change, ignoring confounding effects due to unrelated pauses from the the OS, IO systems, or GC, etc.
But for a real search application what matters is the total response time of every search request. Search is a fundamentally interactive process: the user sits and waits for the results to come back and then iterates from there. If even 1% of the searches take too long to respond that's a serious problem! Users are impatient and will quickly move on to your competitors.
So I made some improvements to luceneutil, including separating out a load testing client (sendTasks.py) that records the response time of all queries, as well as scripts (loadGraph.py and responseTimeGraph.py) to generate the resulting response-time graphs, and a top-level responseTimeTests.py to run a series of tests at increasing loads (queries/sec), automatically stopping once the load is clearly beyond the server's total capacity. As a nice side effect, this also determines the true capacity (max throughput) of the server rather than requiring an approximate measure by extrapolating from the average search latency.
Queries are sent according to a Poisson distribution, to better model the arrival times of real searches, and the client is thread-less so that if you are testing at 200 queries/sec and the server suddenly pauses for 5 seconds then there will be 1000 queries queued up once it wakes up again (this fixes an unfortunately common bug in load testers that dedicate one thread per simulated client).
The client can run (via password-less ssh) on a separate machine; this is important because if the server machine itself (not just the JVM) is experiencing system-wide pauses (e.g. due to heavy swapping) it can cause pauses in the load testing client which will skew the results. Ideally the client runs on an otherwise idle machine and experiences no pauses. The client even disables Python's cyclic garbage collector to further reduce the chance of pauses.
To test Zing, I first indexed the full Wikipedia English database (as of 5/2/2012), totalling 28.78 GB plain text across 33.3 M 1 KB sized documents, including stored fields and term vectors, so I could highlight the hits using
resulting index was 78 GB. For each test, the server loads the entire
RAMDirectory and then the client sends the top 500
hardest (highest document frequency)
including stop words, to the server. At the same time, the server
re-indexes documents (
updateDocument) at the rate of
100/sec (~100 KB/sec), and reopens the searcher once per second.
Each test ran for an hour, discarding results for the first 5 minutes to allow for warmup. Max heap is 140 GB (-Xmx 140G). I also tested
MMapDirectory, with max heap of 4 GB, as a baseline. The
machine has 32 cores (64 with hyper-threading) and 512 GB of RAM, and
the server ran with 40 threads.
I tested at varying load rates (queries/sec), from 50 (very easy) to 275 (too hard), and then plotted the resulting response time percentiles for different configurations. The default Oracle GC (Parallel) was clearly horribly slow (10s of seconds collection time) so I didn't include it. The experimental garbage first (G1) collector was even slower starting up (took 6 hours to load the index into
RAMDirectory, vs 900 seconds for Concurrent
Mark/Sweep (CMS)), and then the queries hit > 100 second latencies, so
I also left it out (this was surprising as G1 is targeted towards
large heaps). The three configurations I did test were CMS at its
defaults settings, with
MMapDirectory as the baseline,
and both CMS and Zing with
At the lightest load (50 QPS), Zing does a good job maintaining low worst-case response time, while CMS shows long worst case response times, even with
To see the net capacity of each configuration, I plotted the 99% response time, across different load rates:
From this it's clear that the peak throughput for
MMap was somewhere between 100 and 150 queries/sec, while
RAMDirectory based indices were somewhere between 225
and 250 queries/second. This is an impressive performance gain! It's
also interesting because in separately
usually only see minor gains when measuring average query latency.
Plotting the same graph, without
CMS + MMapDirectory and
removing the 275 queries/second point (since it's over-capacity):
Zing remains incredibly flat at the 99% percentile, while CMS has high response times already at 100 QPS. At 225 queries/sec load, the highest near-capacity rate, for just CMS and Zing on
The pause times for CMS are worse than they were at 50 QPS: already at the 95% percentile the response times are too slow (4479 milli-seconds).
It's clear from these tests that Zing really has a very low-pause garbage collector, even at high loads, while managing a 140 GB max heap with 78 GB Lucene index loaded into
Furthermore, applications can expect a substantial increase in max
throughput (around 2X faster in this case) and need not fear
RAMDirectory even for large indices, if they can
run under Zing.
Note that Azul has just made the Zing JVM freely available to open-source developers, so now we all can run our own experiments, add Zing into the JVM rotation for our builds, etc.
Next I will test the new
which holds all postings in RAM in simple arrays (no compression) for
fast search performance. It requires even more RAM than this test,
but gives even faster search performance!
Thank you to Azul for providing a beefy server machine and the Zing JVM to run these tests!