Lucene's IndexSearcher class, responsible for executing incoming queries to find their top matching hits from your index, accepts an optional Executor (e.g. a thread pool) during construction. If you pass an
Executor
and your CPUs are idle enough (i.e. your server
is well below its red-line QPS throughput capacity), Lucene will use
multiple concurrent threads to find the top overall hits for each
query.
How does it do that? A Lucene index is segmented, which makes searching it an embarassingly parallel problem: each query must visit all segments in the index, collecting their globally competitive hits. When the query is single-threaded, because you did not pass an
Executor
to IndexSearcher
, that one query thread must visit all
segments sequentially. If the index is large, and your queries are
costly, those queries will naturally require high CPU cost and wall
clock time to find the top hits. This will cause high long-pole
(P90+) query latencies even when you are running the server well
below its red-line QPS (throughput) capacity.
Instead, when you do pass an
Executor
to IndexSearcher
, the segments in the index are first grouped
up front into single thread work units called thread slices.
By
default, large segments belong to their own thread slice and up to
5 smaller segments with at most 250K total documents will be coalesced
into a single thread slice, since they are presumably quick to
search sequentially by a single thread. You can easily customize how segments are
coalesced into thread slices by
subclassing IndexSearcher
and overriding its
protected slices
method. Each incoming query is then executed concurrently, as long as
the server is idle enough to spend multiple CPU cores on one query,
with one thread working on each thread slice for that query.
This powerful feature was originally proposed almost 16 years ago by Jean-François Halleux and then committed by Doug Cutting himself (hello Doug!) and finally refactored into IndexSearcher almost 9 years ago, and has since undergone a bunch of iterative improvements, many unfolding now thanks to Atri Sharma, recently added new Lucene/Solr committer. Such is the distributed power of passionate open-source software development!
Concurrent query execution is a surprisingly little known sleeper feature in Lucene, since it is not yet exposed in Elasticsearch nor Solr, two popular distributed search applications that build on Lucene. Their concurrency model is instead concurrent search across index shards (usually on different servers) for a single query, but using single-threaded search within each shard.
This means many concurrent independent queries are required in order to saturate cluster wide CPU or IO resources. Until the cluster sees at least that minimum floor QPS, the full hardware resources cannot be utilized. For use cases that often see high query rates, this limitation is acceptable. But other common use-cases that have a large index and lower query rate would benefit substantially from concurrent query execution within a single cluster node if Elasticsearch or Solr were to use this feature.
The real-world effects of Moore's law have shifted: modern server class computers are built with amazing and rapidly increasingly concurrent hardware, not just in their CPUs where we now see 96 cores in the latest
c5.24xlarge
AWS EC2 instances,
but also in their Graphic Processing Units (GPUs), memory bus and
DIMMs and solid-state disks (SSDs), which are in fact large concurrent RAID 0
arrays under-the-hood. The recent trend is for CPUs and GPUs to gain
more concurrency (cores), and less so for each individual core to get
too much faster. Why not use all this increasing concurrency to make
all queries faster, and saturate CPU/IO even at low query loads?
Tricky Tradeoffs
Unfortunately, even though searching a Lucene index is a naturally and embarrassingly parallel problem, using multiple threads for one query incurs inherent coordination overhead. To understand why, consider a simple analogy: imagine you need apples, so you send your kids to the local grocery store to buy them. If you have one only child, you send her, she walks around the entire produce section and picks the ten best apples, and brings them home.
But if you have five children and you send all of them to the store, will they come back five times faster, ignoring the "networking" time for them to get to and from the store? How do they efficiently split the work?
Perhaps your children are clever and they first split up all the apple sections in the store (there are many diverse apple choices these days!) into five roughly equal sections. Each runs around their own apple section, picking the ten best apples she can find, and then they all meet up at the checkout counter and work closely to choose the total ten best out of the fifty apples they now have? This is somewhat wasteful, since the children collected fifty apples overall just to choose the actual ten best in the end, but it should indeed be faster than one child picking the ten best overall.
This is effectively how Lucene implements concurrent search today: each searcher thread works alone to find its own top N best hits from one thread slice (the "map" phase), and then, once all query threads have finished and joined back to the main thread, the main thread uses a partial merge sort to find the total top N best hits from the hits collected for each thread slice (the "reduce" phase). Lucene's
CollectorManager
,
Collector
and LeafCollector
abstractions all
work together to implement this. This means more total work is done
versus the single threaded case, since now M * N
total
hits were collected and then reduced to just the top N
in
the end, where M
is the number of concurrent search threads
and N
is the requested number of top hits to retrieve.
That added coordination cost will necessarily hurt the red-line QPS capacity (throughput) of the search node, when running each query concurrently, since Lucene is spending more total CPU cycles finding the top hits. Yet at the same time, it can greatly improve the long-pole query latencies when the search node has plenty of spare CPU resources, since the hardest queries will now run concurrently. Furthermore, that extra cost of collecting more hits and merging them in the end is often a minor impact overall since it is usually the matching and ranking of each hit that dominates the total query cost, especially as the index grows larger, and that cost is efficiently split across threads.
You can further "amplify" this tradeoff by limiting how many queries can run concurrently, thereby maximizing how many CPU cores will be used for each query. You could also estimate up front how costly each query will be and execute that query concurrently only if its cost is large enough, so that easy queries that would run quickly with a single thread do not pay the overhead of synchronizing across multiple threads.
This throughput versus latency tradeoff is frustrating, and it means it might make sense to use a modal approach for your Lucene application. When the cluster is lightly loaded, use multiple threads per query by restricting how many queries may run concurrently, reducing long-pole latencies. But when the cluster is running hot, approaching its red-line capacity, shift to a single thread per query, to maximize throughput. Be sure you are measuring latencies correctly and your load testing client does not suffer from the all-too-common coordinated omission bug! Confirm your load-testing client is using open-loop testing so you see the true latency impact from, say, a long garbage collection pause, I/O hiccup or swapping.
Ongoing and future improvements
Fortunately, there have been some recent exciting improvements to reduce the added overhead for multi-threaded queries. Lucene now also uses the incoming (calling) thread to help with concurrent searching. The algorithm for grouping small segments into slices (thread work units) has improved. Early termination now uses a single shared global hit counter across multiple search threads for one query, reducing the total cost for the query. Query caching will soon use the Executor to cache concurrently and can even be more efficient in some cases when an
Executor
is used. Instead of each search
thread working fully independently and merging top hits only in the end, they
should share information while they concurrently collect such as
their worst scoring top hit collected so far or even
use a
single shared priority queue across all threads. The shared
priority queue may incur too much locking, so as a compromise,
searching
now efficiently
shares the best of the worst collected hit across searcher
threads, which showed
impressive luceneutil benchmark
results.
These improvements are reducing the extra cost of concurrent search but that cost can never be zero as there is an inherent natural cost to more frequent thread context switching, lock contention for shared priority queues, hit counters and priority queue bottoms and possibly difficult effects due to modern non-uniform memory architectures (NUMA).
One curious and disappointing limitation of Lucene's concurrent search is that a fully merged index, down to a single segment, loses all concurrency! This is Bizarro World, since normally one merges their index down to a single segment in order to improve query performance! But when you are looking at long-pole query latencies, a fully merged index is unfortunately slower since all queries are now single threaded again even when you pass an
Executor
to IndexSearcher
. Even a single large newly completed
merge will cause a sawtooth pattern in your long pole latencies as
it reduces the net query concurrency, though the red-line cluster throughput capacity still improves with such merges. One simple idea to address this is
to allow
multiple threads to search a single large segment, which should
work well since Lucene has natural APIs for searching separate regions
in "docid space" of the segment.
Concurrent search has come a long way since Jean-François Halleux first proposed it for Lucene, and I expect it still has a long way to go, to get the point where we truly minimize the added overhead of using multiple threads for costly queries. As Lucene improves its query planning and optimization we will reach a point where easy queries run single-threaded but costly queries run concurrently and efficiently. These improvements must come to Lucene: modern servers continue to add more and more cores but are not making those cores too much faster, so it is inevitable that modern software, including Lucene, must find ways to efficiently tap into all this concurrency.
[I work at Amazon and the postings on this site are my own and do not necessarily represent Amazon's positions]
Hi, Mike!
ReplyDelete;)
DeleteI wondered if this would help the Solr instance at the Royal Danish Library, then I got to the part about large segments (which they have) then I saw the link to the ticket to "Divide Segment Search Amongst Multiple Threads." Thanks for writing this up!
ReplyDelete