Monday, November 19, 2012

Lucene with Zing, Part 2

When I last tested Lucene with the Zing JVM the results were impressive: Zing's fully concurrent C4 garbage collector had very low pause times with the full English Wikipedia index (78 GB) loaded into RAMDirectory, which is not an easy feat since we know RAMDirectory is stressful for the garbage collector.

I had used Lucene 4.0.0 alpha for that first test, so I decided to re-test using Lucene's 4.0.0 GA release and, surprisingly, the results changed! MMapDirectory's max throughput was now better than RAMDirectory's (versus being much lower before), and the concurrent mark/sweep collector (-XX:-UseConcMarkSweepGC) was no longer hitting long GC pauses.

This was very interesting! What change could improve MMapDirectory's performance, and lower the pressure on concurrent mark/sweep's GC to the point where pause times were so much lower in GA compared to alpha?

Fortunately, Zing includes a nice tool, gcLogAnalyser, to visualize all sorts of data extracted from its GC logs. Like Zing's JVM, this tool is also free for open-source development. To enable GC logging you should pass -verbose:gc and -XX:+PrintGCDetails to the Zing JVM, and it's also a good idea to redirect all GC output to a separate file using -Xloggc so that any other output isn't mixed in.

Using gcLogAnalyser it was clear that, for my test, Lucene 4.0.0 alpha had a much higher allocation rate compared to 4.0.0 GA, and from that hint I isolated the difference to this sneaky little performance bug (LUCENE-4289). That issue, present in 4.0.0 alpha and now fixed in 4.0.0 GA, didn't break anything but was causing FastVectorHighlighter to do far too many term lookups internally, consuming too much CPU, slowing down the max throughput and generating extra garbage collection pressure in the process.

It was great that Zing handled the higher allocation rate just fine while concurrent mark/sweep hit long pauses and I wanted to see if this same long pause behavior might happen again if I changed the test.

So I constructed a new test, using 4.0.0 GA, that would bring back some of the garbage collector stress by using term queries matching fewer hits (~170K hits, versus up to ~3M hits before). This would mean more time will be spent hiliting, which is somewhat GC intensive, and less time would be spent retrieving hits, which generates very little garbage. I also increased the near-real-time indexing rate from 100 KB/sec to 2 MB/sec, and ran on the same full English Wikipedia index, but with deleted documents included (net index was ~70% larger in bytes, but the same 33.3 million number of live documents). The net index size was 132G and max heap was 240G (the computer has 512G RAM, plenty!). Each test (one data point in the graphs below) ran for one hour, recording the response time of all queries. I discard the first 5 minute's worth of queries to allow for warmup.

I set concurrent mark/sweep's new generation to 1 GB (-XX:NewSize=1g) because if I didn't, it would strangely sometimes pick a tiny (~18 MB) new generation size which would kill its performance (~10X slower throughput).

Again I tried the experimental G1 collector and it had awful (~130 second) GC hangs, so I left it out.

The first surprising result was the max (saturated) throughput was nearly 2X faster with Zing than with concurrent mark/sweep using RAMDirectory; previously the two had nearly the same maximum throughput:

The graph plots the response time of the slowest query against the actual QPS achieved (total number of queries actually answered in the hour). It's also interesting that MMapDirectory gets higher max throughput than RAMDirectory with concurrent mark/sweep; this is likely because concurrent mark/sweep has higher overhead with a large heap. It could also be in my previous test that concurrent mark/sweep was picking too-small size for the new generation (I had left it at its defaults before).

Here's the same graph, but discarding the over-capacity data points so we can see worst case query latencies below the max throughput:

Again concurrent mark/sweep has non-trivial pauses with both MMapDirectory and RAMDirectory, even at low loads, while Zing remains remarkably flat. What's going on?

Concurrent mark/sweep new generation bottleneck

I suspect in this test the new generation collection is a bottleneck, since it's still a stop-the-world collection, and this explains concurrent mark/sweep's slower max throughput when compared to Zing on RAMDirectory. From the logs, I see the new gen collection running 3-4 times per second and typically taking ~125 msec but sometimes up to 400-500 msec.

Unfortunately, if your application has a high allocation rate, you're stuck between a rock and a hard place: if you tune the new generation size smaller, the throughput drops (this test is already much slower throughput than Zing), but if you tune it larger then you have longer stop-the-world pauses. Zing's C4 collector doesn't have this problem because even its new generation collector is fully concurrent, so it gains the efficiency of a larger new generation without suffering longer pause times.

Furthermore, I think Lucene's segment merges create an allocation pattern that exacerbates the new generation bottleneck: suddenly a large number of long-lived objects are created, causing high-latency new generation collections because these objects are not only copied out of the eden space but must also bounce back and forth in the survivor spaces until they are promoted. This "sudden" arrival of a large number and net size of long-lived objects can cause substantial pauses due to new generation collection.

Concurrent mark/sweep old generation fragmentation

The other, more serious failure mode of concurrent mark/sweep is fragmentation of the old generation. When this strikes, you'll see a "promotion failed" followed by a full (compacting) GC in the GC log which can easily take 10s of seconds on a large heap. I wasn't able to provoke this in my limited testing (maybe I didn't run the tests for long enough), but for a long-running application it's inevitable that eventually you'll hit promotion failed. It's a very real issue, and it comes up frequently on the Solr user's list; here's a recent example.

Azul has a nifty free tool, Fragger, that easily provokes fragmentation and full collection, while only using a small (10% by default) part of the heap and allocating at a light (20 MB/sec default) rate. It "wraps" around any other java command-line, and accelerates testing since it fragments the old gen faster than your application normally would. I ran a quick test, adding Fragger to the search test, and sure enough concurrent mark/sweep hit horrible pauses while Zing was unaffected.

In summary, I remain impressed with Zing, and I wish its C4 collector were the default for Java! Then we all would stop having any GC worries and could freely use Java with very large heaps, fundamentally changing how we build software in this age of very cheap RAM. Zing works fine at its defaults, while concurrent mark/sweep requires substantial tuning, and even then will eventually hit very high pause times due to the new generation bottleneck and old generation fragmentation.

By the way, a very useful JVM option is -XX:+PrintGCApplicationStoppedTime. This prints the net time of every stop-the-world event, like this:
  Total time for which application threads were stopped: 0.4178300 seconds
You can use this to see which stop-the-world phase of your garbage collector is causing your application's hangs.

10 comments:

  1. Interesting post! Btw, you can monitor GC stoppage times and region promotions with "jstat -gcutil pid 5s" on Oracle JDK. This should be a bit more convenient than -XX:+PrintGCApplicationStoppedTime

    ReplyDelete
    Replies
    1. -XX:+PrintGCApplicationStoppedTime reports a number that is different from what the GC logs (and jstat I believe) report for GC operation lengths. The difference is "subtle" but can end up being huge:
      Regular HotSpot GC log entires report the amount of time that GC spent doing whatever work is being reported. During pauses, this time is measured from the point in time when all threads are stopped at a safepoint, to the time that threads were allowed to run again.
      -XX:+PrintGCApplicationStoppedTime reports the length of time form the point where the first thread was asked to reach a safepoint, to the time that all threads had been allowed to run again.
      The difference is the amount of time it takes threads to reach a safepoint. GC work does not start until they do, but threads that have already reached the safepoint are already paused. This time-to-sfaepoint is usually very short, but can sometimes be amazingly long (as in an extra 1/2 second in cases I've seen in the wild, and much longer in lab tests).
      Regular HotSpot logs seem to not consider this time-to-sfaepoint time gap as "GC pause time", because GC has not actually started... I guess one could argue that they are pauses that do not include any GC work.
      Anyway, I ALWAYS ask for -XX:+PrintGCApplicationStoppedTime, it's a much more reliable indicator of stoppage time.
      Or, if you don't want to simply believe what your log files are telling you, you can use something that measures things instead. Like the actual application response time. It won't lie. jHiccup was built for exactly this purpose...

      Delete
  2. Thanks mindas, jstat -gcutil is very useful too!

    ReplyDelete
  3. Why are you not setting the GC to use parallel GC for the new generation?

    It would also be interesting to use +XX:+UseParallelOldGC to see how that works.

    Finally, are you using large page memory on these tests?

    If your hardware is new enough, you can use 1GB large pages (the trick is you have to set the PermGen to at least 1GB, because it has to be at least one memory page). Large page memory helps GC overhead quite a bit).

    Using large page memory and ParallelGC with the CMS collector, and large page memory with the throughput collector (UseParallelOldGC) would be an interesting comparison to Zing that seems to be missing here.

    ReplyDelete
  4. Hi Andrig,

    I'm pretty sure CMS defaults to parallel GC for the new generation.

    In the first set of tests I did try -XX:+UseParallelOldGC and it had horribly long pauses.

    I turned off transparent huge pages (I think that's the same as large page memory?) for these tests: with it on I was seeing longish (~ a few seconds) pauses with both Zing and CMS ... I suspect it was due to huge page compaction/defrag, but I'm not certain.

    ReplyDelete
  5. I don't believe that CMS using parallel GC for the new generation by default, unless the defaults have changed.

    In terms of transparent huge pages, in my testing, it doesn't work that well compared to configuring them statically. So, turning transparent huge pages off was probably the right thing. However, it is not the same as defining them statically. With transparent huge pages, there is a daemon process that has to scan memory and dynamically combine 4k pages into 2MB (for Intel and AMD systems the large page size is usually 2MB).

    If you define them statically, then you don't have this daemon process trying to do it on the fly, while your workload is running. It's much more efficient. For the latest Intel and AMD hardware you can actually go to 1GB pages. I think with static large pages memory (HugeTLB in Linux parlance), I think you might see substantial improvements.

    ReplyDelete
  6. I don't know when Oracle's Java defaults changed, but when I run "java-XX:+PrintCommandLineFlags" (which shows you what "defaults" were picked on startup) with -XX:+UseConcMarkSweepGC, the output includes "-XX:+UseParNewGC". Maybe it's a dynamic decision based on core count / amount of RAM, etc.?

    Ahhhh, I didn't know about static large pages, to avoid the compaction/defrag cost! That's awesome... I need to read up on this. It sounds like it's not "transparent", ie, you must tell the JVM to use large pages (-XX:+UseLargePages). http://www.oracle.com/technetwork/java/javase/tech/largememory-jsp-137182.html has interesting details...

    ReplyDelete
  7. Yes, its not transparent. I have a blog post that describes how you set it up.

    It's here:

    http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html

    This is really old, but the part of configuring the OS for large page memory is still correct. One little tidbit that I would add is that you don't have to do the calculation I show for /etc/security/limits.conf and you can just use the word unlimited, which makes the setup a little less complicated.

    ReplyDelete
  8. What version of the JDK were you using to get the horrible G1 results (130s hangs) in this and the previous test? I'm curious as I've been getting fairly good results on large-ish heaps (20-60G) with JDK 7 u4 and later for heavy-garbage-producing applications... The folks over at hotspot-gc-use@openjdk.java.net have also been very helpful/responsive when attempting to work out some kinks with it.

    ReplyDelete
  9. Hi Aaron,

    It was 1.6.0_32 ... and I was really surprised by the G1 results! But that's good news you are getting good results with it. I wonder if RAMDir in somehow a particularly bad / stressful test for it ...

    ReplyDelete