RAMDirectory
, which is not an easy feat since we know RAMDirectory
is stressful for the garbage collector.
I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene's 4.0.0 GA release and, surprisingly, the results changed!
MMapDirectory
's max throughput was now better
than RAMDirectory
's (versus being much lower before), and
the concurrent mark/sweep collector
(-XX:-UseConcMarkSweepGC
) was no longer hitting long GC
pauses.
This was very interesting! What change could improve
MMapDirectory
's performance, and lower
the pressure on concurrent mark/sweep's GC to the point where pause
times were so much lower in GA compared to alpha?
Fortunately, Zing includes a nice tool,
gcLogAnalyser
, to
visualize all sorts of data extracted from its GC logs. Like Zing's
JVM, this tool is
also free
for open-source development. To enable GC logging you should
pass -verbose:gc
and -XX:+PrintGCDetails
to
the Zing JVM, and it's also a good idea to redirect all GC output to a
separate file using -Xloggc
so that any other output
isn't mixed in.
Using
gcLogAnalyser
it was clear that, for my test,
Lucene 4.0.0 alpha had a much higher allocation rate compared to 4.0.0
GA, and from that hint I isolated the difference
to this
sneaky little performance bug (LUCENE-4289). That issue, present
in 4.0.0 alpha and now fixed in 4.0.0 GA, didn't break anything but
was causing FastVectorHighlighter
to do far too many term
lookups internally, consuming too much CPU, slowing down the max
throughput and generating extra garbage collection pressure in the
process.
It was great that Zing handled the higher allocation rate just fine while concurrent mark/sweep hit long pauses and I wanted to see if this same long pause behavior might happen again if I changed the test.
So I constructed a new test, using 4.0.0 GA, that would bring back some of the garbage collector stress by using term queries matching fewer hits (~170K hits, versus up to ~3M hits before). This would mean more time will be spent hiliting, which is somewhat GC intensive, and less time would be spent retrieving hits, which generates very little garbage. I also increased the near-real-time indexing rate from 100 KB/sec to 2 MB/sec, and ran on the same full English Wikipedia index, but with deleted documents included (net index was ~70% larger in bytes, but the same 33.3 million number of live documents). The net index size was 132G and max heap was 240G (the computer has 512G RAM, plenty!). Each test (one data point in the graphs below) ran for one hour, recording the response time of all queries. I discard the first 5 minute's worth of queries to allow for warmup.
I set concurrent mark/sweep's new generation to 1 GB (
-XX:NewSize=1g
) because if I didn't, it would strangely
sometimes pick a tiny (~18 MB) new generation size which would kill
its performance (~10X slower throughput).
Again I tried the experimental G1 collector and it had awful (~130 second) GC hangs, so I left it out.
The first surprising result was the max (saturated) throughput was nearly 2X faster with Zing than with concurrent mark/sweep using
RAMDirectory
; previously the two had nearly the
same maximum throughput:
The graph plots the response time of the slowest query against the actual QPS achieved (total number of queries actually answered in the hour). It's also interesting that
MMapDirectory
gets
higher max throughput than RAMDirectory
with concurrent
mark/sweep; this is likely because concurrent mark/sweep has higher
overhead with a large heap. It could also be in my previous test that
concurrent mark/sweep was picking too-small size for the new
generation (I had left it at its defaults before).
Here's the same graph, but discarding the over-capacity data points so we can see worst case query latencies below the max throughput:
Again concurrent mark/sweep has non-trivial pauses with both
MMapDirectory
and RAMDirectory
, even at
low loads, while Zing remains remarkably flat. What's going on?
Concurrent mark/sweep new generation bottleneck
I suspect in this test the new generation collection is a bottleneck, since it's still a stop-the-world collection, and this explains concurrent mark/sweep's slower max throughput when compared to Zing on
RAMDirectory
. From the logs, I see the new gen
collection running 3-4 times per second and typically taking ~125 msec
but sometimes up to 400-500 msec.
Unfortunately, if your application has a high allocation rate, you're stuck between a rock and a hard place: if you tune the new generation size smaller, the throughput drops (this test is already much slower throughput than Zing), but if you tune it larger then you have longer stop-the-world pauses. Zing's C4 collector doesn't have this problem because even its new generation collector is fully concurrent, so it gains the efficiency of a larger new generation without suffering longer pause times.
Furthermore, I think Lucene's segment merges create an allocation pattern that exacerbates the new generation bottleneck: suddenly a large number of long-lived objects are created, causing high-latency new generation collections because these objects are not only copied out of the eden space but must also bounce back and forth in the survivor spaces until they are promoted. This "sudden" arrival of a large number and net size of long-lived objects can cause substantial pauses due to new generation collection.
Concurrent mark/sweep old generation fragmentation
The other, more serious failure mode of concurrent mark/sweep is fragmentation of the old generation. When this strikes, you'll see a "promotion failed" followed by a full (compacting) GC in the GC log which can easily take 10s of seconds on a large heap. I wasn't able to provoke this in my limited testing (maybe I didn't run the tests for long enough), but for a long-running application it's inevitable that eventually you'll hit promotion failed. It's a very real issue, and it comes up frequently on the Solr user's list; here's a recent example.
Azul has a nifty free tool, Fragger, that easily provokes fragmentation and full collection, while only using a small (10% by default) part of the heap and allocating at a light (20 MB/sec default) rate. It "wraps" around any other java command-line, and accelerates testing since it fragments the old gen faster than your application normally would. I ran a quick test, adding Fragger to the search test, and sure enough concurrent mark/sweep hit horrible pauses while Zing was unaffected.
In summary, I remain impressed with Zing, and I wish its C4 collector were the default for Java! Then we all would stop having any GC worries and could freely use Java with very large heaps, fundamentally changing how we build software in this age of very cheap RAM. Zing works fine at its defaults, while concurrent mark/sweep requires substantial tuning, and even then will eventually hit very high pause times due to the new generation bottleneck and old generation fragmentation.
By the way, a very useful JVM option is
-XX:+PrintGCApplicationStoppedTime
. This prints the
net time of every stop-the-world event, like this:
Total time for which application threads were stopped: 0.4178300 secondsYou can use this to see which stop-the-world phase of your garbage collector is causing your application's hangs.