Changing Bits: Apache Lucene performance on 128-core AMD Ryzen Threadripper 3990X

Almost a decade ago, I started running Lucene's nightly benchmarks, and have been trying with mixed success to keep them running every night, through the numerous amazing changes relentlessly developed by the passionate Lucene community. The benchmarks run on the tip of Lucene's mainline branch each night, which is understandably a volatile and high velocity code base.

Sure, Lucene's wonderful randomized unit tests will catch an accidental bug, API breakage or perhaps a subtle corner-case issue during development. But nothing otherwise catches all-too-easy unexpected performance regressions, nor helps us measure performance gains when we optimize.

As a recent example, it looks like upgrading from JDK 12 to JDK 15 might have hurt Lucene's Month faceting queries/sec by ~5% (look for annotation DG in that chart). However, that was not the only change in that period, benchmarks failed to run for a few nights, and other tasks don't seem to show such a drop, so it's possible (likely?) there is another root cause. Such is the challenge of benchmarking! WTFs suddenly sprout up all the time.

Time flies when you are having fun: it has been almost five years since I last upgraded the custom hardware that runs Lucene's nightly benchmarks, nearly an eternity in computer-years! Thanks to the fast paced technology market, computers keep getting relentlessly bigger, smaller, faster and cheaper.

So, finally, as of a couple months ago, November 6, 2020, I have switched our nightly benchmarks to a new custom-built workstation, creatively named beast3, with these parts:

Single socket AMD Ryzen Threadripper "desktop class" 3990X (64 cores, 128 with hyperthreading), clocked/volted at defaults
256 GB quad channel Multi-Bit ECC DDR 4 RAM, to reduce the chance of errant confusing bit flips possibly wasting precious developer time (plus Linus agrees!)
Intel Optane SSD 905P Series, 960GB
RAID 1 array (mirror) of NVMe Samsung 970 pro 1 TB SSDs
A spinning-magnets 16 TB Seagate IronWolf Pro
Arch Linux, kernel 5.9.8-arch1-1
OpenJDK 15.0.1+9-18

All Lucene benchmarks use the Optane SSD to store their Lucene indices, though it is likely unimportant since the 256 GB of RAM ensures the indices are nearly entirely hot. All source documents are pulled from the RAID 1 SSD mirror to ensure reading the source documents is very fast and will not conflict with writing the Lucene indices.

beast2 was an impressive workstation five years ago, with dual socket Intel Xeon E5-2699 v3 "server class" CPUs, but this new workstation, now using a lower class "desktop class" CPU, in a single socket, is a even faster.

Watching top while running gradle test configured to use 64 JVMs is truly astounding. At times my whole terminal window is filled with only java! But, this also reveals the overall poor concurrency of Lucene's gradle/test-framework compiling and executing our numerous unit tests on highly concurrent hardware. Compilation of all main and test sources takes minutes and looks nearly single-threaded, with a single java process taking ~100% CPU. Most of the time my terminal is NOT full of java processes, and overall load is well below what the hardware could achieve. Patches welcome!

The gains across our various benchmarks are impressive:

Indexing: ~42% faster for medium sized (~4 KB) docs, ~32% faster for small (~1 KB) docs
Primary Key Lookup: ~49% faster
TermQuery: ~48% faster
BooleanQuery conjunctions of two high frequency terms: ~38% faster
Month faceting: ~36% gain, followed by unexplained ~32% drop! (Edit: OK, it looks like it might be due to Lucene's default Codec no longer compressing BinaryDocValues by default -- we can fix that!)
FuzzyQuery, edit distance 1: ~35%
Geo-spatial filtering by Russia Polygon, LatLonPoint: ~31%
LatLonPoint geo-spatial indexing: ~48%
10K grouping with TermQuery: ~39%
Time to run all Lucene unit tests: ~43%
Time to CheckIndex: ~22%

Most of these tasks are by design effectively testing single-core performance, showing each core of the new CPU is also substantially faster than one core of the older Xeon. The exceptions are Indexing, Primary Key Lookup and Time to run all Lucene unit tests, which do effectively use multiple cores.

I am happy to see the sizable jump in Lucene's indexing throughput, despite not yet increasing the number of indexing threads (still 36): it shows that Lucene's indexing implementation is indeed quite concurrent, allowing the faster cores to index more efficiently. However, smaller ~1 KB documents saw less gains than larger ~4 KB documents, likely due to some sort of locking contention in IndexWriter that is relatively more costly with smaller documents. Patches welcome!

The only serious wrinkle with upgrading to this new box is that rarely, a java process will simply hang, forever, until I notice, jstack and kill -9 it. I have opened this issue to try to get to the bottom of it. It may be yet another classloader deadlock bug.

Another small challenge is this is my first custom liquid cooling loop, and I am surprised how quickly (relatively speaking) the coolant "evaporates" despite being a closed loop with no obvious leaks. I just must remember to add more coolant periodically, or else the CPU might start thermal throttling and make everything go slowly!

Changing Bits

Monday, January 4, 2021

Apache Lucene performance on 128-core AMD Ryzen Threadripper 3990X

No comments:

Post a Comment