Sunday, October 6, 2019

Concurrent query execution in Apache Lucene

Apache Lucene is a wonderfully concurrent pure Java search engine, easily able to saturate the available CPU or IO resources on your server, if you ask it to. The concurrency model for a "typical" Lucene application is one thread per query at search time, but did you know Lucene can also execute a single query concurrently using multiple threads to greatly reduce how long your slowest queries take?

Lucene's IndexSearcher class, responsible for executing incoming queries to find their top matching hits from your index, accepts an optional Executor (e.g. a thread pool) during construction. If you pass an Executor and your CPUs are idle enough (i.e. your server is well below its red-line QPS throughput capacity), Lucene will use multiple concurrent threads to find the top overall hits for each query.

How does it do that? A Lucene index is segmented, which makes searching it an embarassingly parallel problem: each query must visit all segments in the index, collecting their globally competitive hits. When the query is single-threaded, because you did not pass an Executor to IndexSearcher, that one query thread must visit all segments sequentially. If the index is large, and your queries are costly, those queries will naturally require high CPU cost and wall clock time to find the top hits. This will cause high long-pole (P90+) query latencies even when you are running the server well below its red-line QPS (throughput) capacity.

Instead, when you do pass an Executor to IndexSearcher, the segments in the index are first grouped up front into single thread work units called thread slices. By default, large segments belong to their own thread slice and up to 5 smaller segments with at most 250K total documents will be coalesced into a single thread slice, since they are presumably quick to search sequentially by a single thread. You can easily customize how segments are coalesced into thread slices by subclassing IndexSearcher and overriding its protected slices method. Each incoming query is then executed concurrently, as long as the server is idle enough to spend multiple CPU cores on one query, with one thread working on each thread slice for that query.

This powerful feature was originally proposed almost 16 years ago by Jean-François Halleux and then committed by Doug Cutting himself (hello Doug!) and finally refactored into IndexSearcher almost 9 years ago, and has since undergone a bunch of iterative improvements, many unfolding now thanks to Atri Sharma, recently added new Lucene/Solr committer. Such is the distributed power of passionate open-source software development!

Concurrent query execution is a surprisingly little known sleeper feature in Lucene, since it is not yet exposed in Elasticsearch nor Solr, two popular distributed search applications that build on Lucene. Their concurrency model is instead concurrent search across index shards (usually on different servers) for a single query, but using single-threaded search within each shard.

This means many concurrent independent queries are required in order to saturate cluster wide CPU or IO resources. Until the cluster sees at least that minimum floor QPS, the full hardware resources cannot be utilized. For use cases that often see high query rates, this limitation is acceptable. But other common use-cases that have a large index and lower query rate would benefit substantially from concurrent query execution within a single cluster node if Elasticsearch or Solr were to use this feature.

The real-world effects of Moore's law have shifted: modern server class computers are built with amazing and rapidly increasingly concurrent hardware, not just in their CPUs where we now see 96 cores in the latest c5.24xlarge AWS EC2 instances, but also in their Graphic Processing Units (GPUs), memory bus and DIMMs and solid-state disks (SSDs), which are in fact large concurrent RAID 0 arrays under-the-hood. The recent trend is for CPUs and GPUs to gain more concurrency (cores), and less so for each individual core to get too much faster. Why not use all this increasing concurrency to make all queries faster, and saturate CPU/IO even at low query loads?

Tricky Tradeoffs

Unfortunately, even though searching a Lucene index is a naturally and embarrassingly parallel problem, using multiple threads for one query incurs inherent coordination overhead. To understand why, consider a simple analogy: imagine you need apples, so you send your kids to the local grocery store to buy them. If you have one only child, you send her, she walks around the entire produce section and picks the ten best apples, and brings them home.

But if you have five children and you send all of them to the store, will they come back five times faster, ignoring the "networking" time for them to get to and from the store? How do they efficiently split the work?

Perhaps your children are clever and they first split up all the apple sections in the store (there are many diverse apple choices these days!) into five roughly equal sections. Each runs around their own apple section, picking the ten best apples she can find, and then they all meet up at the checkout counter and work closely to choose the total ten best out of the fifty apples they now have? This is somewhat wasteful, since the children collected fifty apples overall just to choose the actual ten best in the end, but it should indeed be faster than one child picking the ten best overall.

This is effectively how Lucene implements concurrent search today: each searcher thread works alone to find its own top N best hits from one thread slice (the "map" phase), and then, once all query threads have finished and joined back to the main thread, the main thread uses a partial merge sort to find the total top N best hits from the hits collected for each thread slice (the "reduce" phase). Lucene's CollectorManager, Collector and LeafCollector abstractions all work together to implement this. This means more total work is done versus the single threaded case, since now M * N total hits were collected and then reduced to just the top N in the end, where M is the number of concurrent search threads and N is the requested number of top hits to retrieve.

That added coordination cost will necessarily hurt the red-line QPS capacity (throughput) of the search node, when running each query concurrently, since Lucene is spending more total CPU cycles finding the top hits. Yet at the same time, it can greatly improve the long-pole query latencies when the search node has plenty of spare CPU resources, since the hardest queries will now run concurrently. Furthermore, that extra cost of collecting more hits and merging them in the end is often a minor impact overall since it is usually the matching and ranking of each hit that dominates the total query cost, especially as the index grows larger, and that cost is efficiently split across threads.

You can further "amplify" this tradeoff by limiting how many queries can run concurrently, thereby maximizing how many CPU cores will be used for each query. You could also estimate up front how costly each query will be and execute that query concurrently only if its cost is large enough, so that easy queries that would run quickly with a single thread do not pay the overhead of synchronizing across multiple threads.

This throughput versus latency tradeoff is frustrating, and it means it might make sense to use a modal approach for your Lucene application. When the cluster is lightly loaded, use multiple threads per query by restricting how many queries may run concurrently, reducing long-pole latencies. But when the cluster is running hot, approaching its red-line capacity, shift to a single thread per query, to maximize throughput. Be sure you are measuring latencies correctly and your load testing client does not suffer from the all-too-common coordinated omission bug! Confirm your load-testing client is using open-loop testing so you see the true latency impact from, say, a long garbage collection pause, I/O hiccup or swapping.

Ongoing and future improvements

Fortunately, there have been some recent exciting improvements to reduce the added overhead for multi-threaded queries. Lucene now also uses the incoming (calling) thread to help with concurrent searching. The algorithm for grouping small segments into slices (thread work units) has improved. Early termination now uses a single shared global hit counter across multiple search threads for one query, reducing the total cost for the query. Query caching will soon use the Executor to cache concurrently and can even be more efficient in some cases when an Executor is used. Instead of each search thread working fully independently and merging top hits only in the end, they should share information while they concurrently collect such as their worst scoring top hit collected so far or even use a single shared priority queue across all threads. The shared priority queue may incur too much locking, so as a compromise, searching now efficiently shares the best of the worst collected hit across searcher threads, which showed impressive luceneutil benchmark results.

These improvements are reducing the extra cost of concurrent search but that cost can never be zero as there is an inherent natural cost to more frequent thread context switching, lock contention for shared priority queues, hit counters and priority queue bottoms and possibly difficult effects due to modern non-uniform memory architectures (NUMA).

One curious and disappointing limitation of Lucene's concurrent search is that a fully merged index, down to a single segment, loses all concurrency! This is Bizarro World, since normally one merges their index down to a single segment in order to improve query performance! But when you are looking at long-pole query latencies, a fully merged index is unfortunately slower since all queries are now single threaded again even when you pass an Executor to IndexSearcher. Even a single large newly completed merge will cause a sawtooth pattern in your long pole latencies as it reduces the net query concurrency, though the red-line cluster throughput capacity still improves with such merges. One simple idea to address this is to allow multiple threads to search a single large segment, which should work well since Lucene has natural APIs for searching separate regions in "docid space" of the segment.

Concurrent search has come a long way since Jean-François Halleux first proposed it for Lucene, and I expect it still has a long way to go, to get the point where we truly minimize the added overhead of using multiple threads for costly queries. As Lucene improves its query planning and optimization we will reach a point where easy queries run single-threaded but costly queries run concurrently and efficiently.  These improvements must come to Lucene: modern servers continue to add more and more cores but are not making those cores too much faster, so it is inevitable that modern software, including Lucene, must find ways to efficiently tap into all this concurrency.

[I work at Amazon and the postings on this site are my own and do not necessarily represent Amazon's positions]