Google's entire index has been in RAM for at least 5 years now. Why
not do the same with an Apache Lucene search index?
RAM has become very affordable recently, so for high-traffic sites the
performance gains from holding the entire index in RAM should quickly
pay for the up-front hardware cost.
The obvious approach is to load the index into
Lucene's
RAMDirectory, right?
Unfortunately, this class is known to put a heavy load on the garbage
collector (GC): each file is naively held as a
List
of
byte[1024] fragments (there are open Jira issues to
address this but they haven't been committed yet). It also has unnecessary
synchronization. If the application is updating the index (not just
searching), another challenge is how to persist ongoing changes
from
RAMDirectory back to disk. Startup is much slower
as the index must first be loaded into RAM. Given these problems,
Lucene developers generally recommend using
RAMDirectory
only for small indices or for testing purposes, and otherwise trusting
the operating system to manage RAM by using
MMapDirectory
(see
Uwe's
excellent post for more details).
While there are open issues to improve
RAMDirectory
(
LUCENE-4123
and
LUCENE-3659),
they haven't been committed and many users simply
use
RAMDirectory anyway.
Recently I heard about the
Zing
JVM, from
Azul, which provides a
pauseless garbage collector even for very large heaps.
In theory the
high GC load of
RAMDirectory should not be a problem for
Zing. Let's test it! But first, a quick digression on the importance
of measuring search response time of all requests.
Search response time percentiles
Normally our performance testing scripts
(
luceneutil)
measure the
average response time, discarding outliers. We
do this because we are typically testing an algorithmic change and
want to see the effect of that change, ignoring confounding effects
due to unrelated pauses from the the OS, IO systems, or GC, etc.
But
for a real search application what matters is the total response time
of every search request. Search is a fundamentally interactive process:
the user sits and waits for the results to come back and then iterates
from there. If even 1% of the searches take too long
to respond that's a serious problem! Users are impatient and will quickly move on
to your competitors.
So I made some improvements to
luceneutil, including separating out a load testing
client
(
sendTasks.py)
that records the response time of all queries, as well as scripts
(
loadGraph.py
and
responseTimeGraph.py)
to generate the resulting response-time graphs, and a top-level
responseTimeTests.py
to run a series of tests at increasing loads (queries/sec),
automatically stopping once the load is clearly beyond the server's
total capacity. As a nice side effect, this also determines the true capacity (max
throughput) of the server rather than requiring an approximate measure
by extrapolating from the average search latency.
Queries are sent according to
a
Poisson
distribution, to better model the arrival times of real searches,
and the client is thread-less so that if you are testing at 200
queries/sec and the server suddenly pauses for 5 seconds then there
will be 1000 queries queued up once it wakes up again (this fixes an
unfortunately common bug in load testers that dedicate one thread per
simulated client).
The client can run (via password-less ssh) on a separate machine; this
is important because if the server machine itself (not just the JVM)
is experiencing system-wide pauses (e.g. due to heavy swapping) it can
cause pauses in the load testing client which will skew the results.
Ideally the client runs on an otherwise idle machine and experiences
no pauses. The client even disables
Python's
cyclic
garbage collector to further reduce the chance of pauses.
Wikipedia tests
To test Zing, I first indexed the full Wikipedia English database (as of
5/2/2012), totalling 28.78 GB plain text across 33.3 M 1 KB sized
documents, including stored fields and term vectors, so I could
highlight the hits using
FastVectorHighlighter. The
resulting index was 78 GB. For each test, the server loads the entire
index into
RAMDirectory and then the client sends the top 500
hardest (highest document frequency)
TermQuerys,
including stop words, to the server. At the same time, the server
re-indexes documents (
updateDocument) at the rate of
100/sec (~100 KB/sec), and reopens the searcher once per second.
Each test ran for an hour, discarding results for the first 5 minutes
to allow for warmup. Max heap is 140 GB (-Xmx 140G). I also tested
MMapDirectory, with max heap of 4 GB, as a baseline. The
machine has 32 cores (64 with hyper-threading) and 512 GB of RAM, and
the server ran with 40 threads.
I tested at varying load rates (queries/sec), from 50 (very easy) to
275 (too hard), and then plotted the resulting response time
percentiles for different configurations. The default Oracle GC
(Parallel) was clearly horribly slow (10s of seconds collection time)
so I didn't include it. The
experimental
garbage
first (G1) collector
was even slower starting up (took 6 hours to load the index
into
RAMDirectory, vs 900 seconds for Concurrent
Mark/Sweep (CMS)), and then the queries hit > 100 second latencies, so
I also left it out (this was surprising as G1 is targeted towards
large heaps). The three configurations I did test were CMS at its
defaults settings, with
MMapDirectory as the baseline,
and both CMS and Zing with
RAMDirectory.
At the lightest load (50 QPS), Zing does a good job maintaining low
worst-case response time, while CMS shows long worst case response
times, even with
MMapDirectory:
To see the net capacity of each configuration, I plotted the 99%
response time, across different load rates:
From this it's clear that the peak throughput for
CMS +
MMap was somewhere between 100 and 150 queries/sec, while
the
RAMDirectory based indices were somewhere between 225
and 250 queries/second. This is an impressive performance gain! It's
also interesting because in separately
testing
RAMDirectory vs
MMapDirectory I
usually only see minor gains when measuring average query latency.
Plotting the same graph, without
CMS + MMapDirectory and
removing the 275 queries/second point (since it's over-capacity):
Zing remains incredibly flat at the 99% percentile, while CMS has
high response times already at 100 QPS. At 225 queries/sec
load, the highest near-capacity rate, for just CMS and Zing
on
RAMDirectory:
The pause times for CMS are worse than they were at 50 QPS: already at
the 95% percentile the response times are too slow (4479
milli-seconds).
Zing works!
It's clear from these tests that Zing really has a very low-pause
garbage collector, even at high loads, while managing a 140 GB max
heap with 78 GB Lucene index loaded into
RAMDirectory.
Furthermore, applications can expect a substantial increase in max
throughput (around 2X faster in this case) and need not fear
using
RAMDirectory even for large indices, if they can
run under Zing.
Note that Azul has just
made
the Zing JVM freely available to open-source developers, so now we all
can run our own experiments, add Zing into the JVM rotation for our builds, etc.
Next I will test the
new
DirectPostingsFormat
which holds all postings in RAM in simple arrays (no compression) for
fast search performance. It requires even more RAM than this test,
but gives even faster search performance!
Thank you to Azul for providing a beefy server machine and the
Zing JVM to run these tests!