Sure, Lucene's wonderful randomized unit tests will catch an accidental bug, API breakage or perhaps a subtle corner-case issue during development. But nothing otherwise catches all-too-easy unexpected performance regressions, nor helps us measure performance gains when we optimize.
As a recent example, it looks like upgrading from JDK 12 to JDK 15 might have hurt Lucene's Month faceting queries/sec by ~5% (look for annotation DG in that chart). However, that was not the only change in that period, benchmarks failed to run for a few nights, and other tasks don't seem to show such a drop, so it's possible (likely?) there is another root cause. Such is the challenge of benchmarking! WTFs suddenly sprout up all the time.
Time flies when you are having fun: it has been almost five years since I last upgraded the custom hardware that runs Lucene's nightly benchmarks, nearly an eternity in computer-years! Thanks to the fast paced technology market, computers keep getting relentlessly bigger, smaller, faster and cheaper.
So, finally, as of a couple months ago, November 6, 2020, I have switched our nightly benchmarks to a new custom-built workstation, creatively named beast3, with these parts:
- Single socket AMD Ryzen Threadripper "desktop class" 3990X (64 cores, 128 with hyperthreading), clocked/volted at defaults
- 256 GB quad channel Multi-Bit ECC DDR 4 RAM, to reduce the chance of errant confusing bit flips possibly wasting precious developer time (plus Linus agrees!)
- Intel Optane SSD 905P Series, 960GB
- RAID 1 array (mirror) of NVMe Samsung 970 pro 1 TB SSDs
- A spinning-magnets 16 TB Seagate IronWolf Pro
- Arch Linux, kernel 5.9.8-arch1-1
- OpenJDK 15.0.1+9-18
Watching top while running gradle test configured to use 64 JVMs is truly astounding. At times my whole terminal window is filled with only java! But, this also reveals the overall poor concurrency of Lucene's gradle/test-framework compiling and executing our numerous unit tests on highly concurrent hardware. Compilation of all main and test sources takes minutes and looks nearly single-threaded, with a single java process taking ~100% CPU. Most of the time my terminal is NOT full of java processes, and overall load is well below what the hardware could achieve. Patches welcome!
The gains across our various benchmarks are impressive:
- Indexing: ~42% faster for medium sized (~4 KB) docs, ~32% faster for small (~1 KB) docs
- Primary Key Lookup: ~49% faster
- TermQuery: ~48% faster
- BooleanQuery conjunctions of two high frequency terms: ~38% faster
- Month faceting: ~36% gain, followed by unexplained ~32% drop! (Edit: OK, it looks like it might be due to Lucene's default Codec no longer compressing BinaryDocValues by default -- we can fix that!)
- FuzzyQuery, edit distance 1: ~35%
- Geo-spatial filtering by Russia Polygon, LatLonPoint: ~31%
- LatLonPoint geo-spatial indexing: ~48%
- 10K grouping with TermQuery: ~39%
- Time to run all Lucene unit tests: ~43%
- Time to CheckIndex: ~22%
I am happy to see the sizable jump in Lucene's indexing throughput, despite not yet increasing the number of indexing threads (still 36): it shows that Lucene's indexing implementation is indeed quite concurrent, allowing the faster cores to index more efficiently. However, smaller ~1 KB documents saw less gains than larger ~4 KB documents, likely due to some sort of locking contention in IndexWriter that is relatively more costly with smaller documents. Patches welcome!
The only serious wrinkle with upgrading to this new box is that rarely, a java process will simply hang, forever, until I notice, jstack and kill -9 it. I have opened this issue to try to get to the bottom of it. It may be yet another classloader deadlock bug.
Another small challenge is this is my first custom liquid cooling loop, and I am surprised how quickly (relatively speaking) the coolant "evaporates" despite being a closed loop with no obvious leaks. I just must remember to add more coolant periodically, or else the CPU might start thermal throttling and make everything go slowly!