Changing Bits: November 2012

When I last tested Lucene with the Zing JVM the results were impressive: Zing's fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene's 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory's max throughput was now better than RAMDirectory's (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory's performance, and lower the pressure on concurrent mark/sweep's GC to the point where pause times were so much lower in GA compared to alpha?

Fortunately, Zing includes a nice tool, gcLogAnalyser, to visualize all sorts of data extracted from its GC logs. Like Zing's JVM, this tool is also free for open-source development. To enable GC logging you should pass -verbose:gc and -XX:+PrintGCDetails to the Zing JVM, and it's also a good idea to redirect all GC output to a separate file using -Xloggc so that any other output isn't mixed in.

Using gcLogAnalyser it was clear that, for my test, Lucene 4.0.0 alpha had a much higher allocation rate compared to 4.0.0 GA, and from that hint I isolated the difference to this sneaky little performance bug (LUCENE-4289). That issue, present in 4.0.0 alpha and now fixed in 4.0.0 GA, didn't break anything but was causing FastVectorHighlighter to do far too many term lookups internally, consuming too much CPU, slowing down the max throughput and generating extra garbage collection pressure in the process.

It was great that Zing handled the higher allocation rate just fine while concurrent mark/sweep hit long pauses and I wanted to see if this same long pause behavior might happen again if I changed the test.

So I constructed a new test, using 4.0.0 GA, that would bring back some of the garbage collector stress by using term queries matching fewer hits (~170K hits, versus up to ~3M hits before). This would mean more time will be spent hiliting, which is somewhat GC intensive, and less time would be spent retrieving hits, which generates very little garbage. I also increased the near-real-time indexing rate from 100 KB/sec to 2 MB/sec, and ran on the same full English Wikipedia index, but with deleted documents included (net index was ~70% larger in bytes, but the same 33.3 million number of live documents). The net index size was 132G and max heap was 240G (the computer has 512G RAM, plenty!). Each test (one data point in the graphs below) ran for one hour, recording the response time of all queries. I discard the first 5 minute's worth of queries to allow for warmup.

I set concurrent mark/sweep's new generation to 1 GB (-XX:NewSize=1g) because if I didn't, it would strangely sometimes pick a tiny (~18 MB) new generation size which would kill its performance (~10X slower throughput).

Again I tried the experimental G1 collector and it had awful (~130 second) GC hangs, so I left it out.

The first surprising result was the max (saturated) throughput was nearly 2X faster with Zing than with concurrent mark/sweep using RAMDirectory; previously the two had nearly the same maximum throughput:

The graph plots the response time of the slowest query against the actual QPS achieved (total number of queries actually answered in the hour). It's also interesting that MMapDirectory gets higher max throughput than RAMDirectory with concurrent mark/sweep; this is likely because concurrent mark/sweep has higher overhead with a large heap. It could also be in my previous test that concurrent mark/sweep was picking too-small size for the new generation (I had left it at its defaults before).

Here's the same graph, but discarding the over-capacity data points so we can see worst case query latencies below the max throughput:

Again concurrent mark/sweep has non-trivial pauses with both MMapDirectory and RAMDirectory, even at low loads, while Zing remains remarkably flat. What's going on?

Concurrent mark/sweep new generation bottleneck

I suspect in this test the new generation collection is a bottleneck, since it's still a stop-the-world collection, and this explains concurrent mark/sweep's slower max throughput when compared to Zing on RAMDirectory. From the logs, I see the new gen collection running 3-4 times per second and typically taking ~125 msec but sometimes up to 400-500 msec.

Unfortunately, if your application has a high allocation rate, you're stuck between a rock and a hard place: if you tune the new generation size smaller, the throughput drops (this test is already much slower throughput than Zing), but if you tune it larger then you have longer stop-the-world pauses. Zing's C4 collector doesn't have this problem because even its new generation collector is fully concurrent, so it gains the efficiency of a larger new generation without suffering longer pause times.

Furthermore, I think Lucene's segment merges create an allocation pattern that exacerbates the new generation bottleneck: suddenly a large number of long-lived objects are created, causing high-latency new generation collections because these objects are not only copied out of the eden space but must also bounce back and forth in the survivor spaces until they are promoted. This "sudden" arrival of a large number and net size of long-lived objects can cause substantial pauses due to new generation collection.

Concurrent mark/sweep old generation fragmentation

The other, more serious failure mode of concurrent mark/sweep is fragmentation of the old generation. When this strikes, you'll see a "promotion failed" followed by a full (compacting) GC in the GC log which can easily take 10s of seconds on a large heap. I wasn't able to provoke this in my limited testing (maybe I didn't run the tests for long enough), but for a long-running application it's inevitable that eventually you'll hit promotion failed. It's a very real issue, and it comes up frequently on the Solr user's list; here's a recent example.

Azul has a nifty free tool, Fragger, that easily provokes fragmentation and full collection, while only using a small (10% by default) part of the heap and allocating at a light (20 MB/sec default) rate. It "wraps" around any other java command-line, and accelerates testing since it fragments the old gen faster than your application normally would. I ran a quick test, adding Fragger to the search test, and sure enough concurrent mark/sweep hit horrible pauses while Zing was unaffected.

In summary, I remain impressed with Zing, and I wish its C4 collector were the default for Java! Then we all would stop having any GC worries and could freely use Java with very large heaps, fundamentally changing how we build software in this age of very cheap RAM. Zing works fine at its defaults, while concurrent mark/sweep requires substantial tuning, and even then will eventually hit very high pause times due to the new generation bottleneck and old generation fragmentation.

By the way, a very useful JVM option is -XX:+PrintGCApplicationStoppedTime. This prints the net time of every stop-the-world event, like this:

  Total time for which application threads were stopped: 0.4178300 seconds

You can use this to see which stop-the-world phase of your garbage collector is causing your application's hangs.

Changing Bits

Monday, November 19, 2012

Lucene with Zing, Part 2