Monday, January 4, 2021

Apache Lucene performance on 128-core AMD Ryzen Threadripper 3990X

Almost a decade ago, I started running Lucene's nightly benchmarks, and have been trying with mixed success to keep them running every night, through the numerous amazing changes relentlessly developed by the passionate Lucene community. The benchmarks run on the tip of Lucene's mainline branch each night, which is understandably a volatile and high velocity code base.

Sure, Lucene's wonderful randomized unit tests will catch an accidental bug, API breakage or perhaps a subtle corner-case issue during development. But nothing otherwise catches all-too-easy unexpected performance regressions, nor helps us measure performance gains when we optimize.

As a recent example, it looks like upgrading from JDK 12 to JDK 15 might have hurt Lucene's Month faceting queries/sec by ~5% (look for annotation DG in that chart). However, that was not the only change in that period, benchmarks failed to run for a few nights, and other tasks don't seem to show such a drop, so it's possible (likely?) there is another root cause. Such is the challenge of benchmarking!  WTFs suddenly sprout up all the time.  

Time flies when you are having fun: it has been almost five years since I last upgraded the custom hardware that runs Lucene's nightly benchmarks, nearly an eternity in computer-years! Thanks to the fast paced technology market, computers keep getting relentlessly bigger, smaller, faster and cheaper.

So, finally, as of a couple months ago, November 6, 2020, I have switched our nightly benchmarks to a new custom-built workstation, creatively named beast3, with these parts:

  • Single socket AMD Ryzen Threadripper "desktop class" 3990X (64 cores, 128 with hyperthreading), clocked/volted at defaults
  • 256 GB quad channel Multi-Bit ECC DDR 4 RAM, to reduce the chance of errant confusing bit flips possibly wasting precious developer time (plus Linus agrees!)
  • Intel Optane SSD 905P Series, 960GB
  • RAID 1 array (mirror) of NVMe Samsung 970 pro 1 TB SSDs
  • A spinning-magnets 16 TB Seagate IronWolf Pro
  • Arch Linux, kernel 5.9.8-arch1-1
  • OpenJDK 15.0.1+9-18
All Lucene benchmarks use the Optane SSD to store their Lucene indices, though it is likely unimportant since the 256 GB of RAM ensures the indices are nearly entirely hot. All source documents are pulled from the RAID 1 SSD mirror to ensure reading the source documents is very fast and will not conflict with writing the Lucene indices.

beast2 was an impressive workstation five years ago, with dual socket Intel Xeon E5-2699 v3 "server class" CPUs, but this new workstation, now using a lower class "desktop class" CPU, in a single socket, is a even faster.

Watching top while running gradle test configured to use 64 JVMs is truly astounding. At times my whole terminal window is filled with only java! But, this also reveals the overall poor concurrency of Lucene's gradle/test-framework compiling and executing our numerous unit tests on highly concurrent hardware. Compilation of all main and test sources takes minutes and looks nearly single-threaded, with a single java process taking ~100% CPU. Most of the time my terminal is NOT full of java processes, and overall load is well below what the hardware could achieve. Patches welcome!

The gains across our various benchmarks are impressive:

Most of these tasks are by design effectively testing single-core performance, showing each core of the new CPU is also substantially faster than one core of the older Xeon. The exceptions are Indexing, Primary Key Lookup and Time to run all Lucene unit tests, which do effectively use multiple cores.

I am happy to see the sizable jump in Lucene's indexing throughput, despite not yet increasing the number of indexing threads (still 36): it shows that Lucene's indexing implementation is indeed quite concurrent, allowing the faster cores to index more efficiently. However, smaller ~1 KB documents saw less gains than larger ~4 KB documents, likely due to some sort of locking contention in IndexWriter that is relatively more costly with smaller documents. Patches welcome!

The only serious wrinkle with upgrading to this new box is that rarely, a java process will simply hang, forever, until I notice, jstack and kill -9 it. I have opened this issue to try to get to the bottom of it.  It may be yet another classloader deadlock bug.

Another small challenge is this is my first custom liquid cooling loop, and I am surprised how quickly (relatively speaking) the coolant "evaporates" despite being a closed loop with no obvious leaks. I just must remember to add more coolant periodically, or else the CPU might start thermal throttling and make everything go slowly!

Sunday, October 6, 2019

Concurrent query execution in Apache Lucene

Apache Lucene is a wonderfully concurrent pure Java search engine, easily able to saturate the available CPU or IO resources on your server, if you ask it to. The concurrency model for a "typical" Lucene application is one thread per query at search time, but did you know Lucene can also execute a single query concurrently using multiple threads to greatly reduce how long your slowest queries take?

Lucene's IndexSearcher class, responsible for executing incoming queries to find their top matching hits from your index, accepts an optional Executor (e.g. a thread pool) during construction. If you pass an Executor and your CPUs are idle enough (i.e. your server is well below its red-line QPS throughput capacity), Lucene will use multiple concurrent threads to find the top overall hits for each query.

How does it do that? A Lucene index is segmented, which makes searching it an embarassingly parallel problem: each query must visit all segments in the index, collecting their globally competitive hits. When the query is single-threaded, because you did not pass an Executor to IndexSearcher, that one query thread must visit all segments sequentially. If the index is large, and your queries are costly, those queries will naturally require high CPU cost and wall clock time to find the top hits. This will cause high long-pole (P90+) query latencies even when you are running the server well below its red-line QPS (throughput) capacity.

Instead, when you do pass an Executor to IndexSearcher, the segments in the index are first grouped up front into single thread work units called thread slices. By default, large segments belong to their own thread slice and up to 5 smaller segments with at most 250K total documents will be coalesced into a single thread slice, since they are presumably quick to search sequentially by a single thread. You can easily customize how segments are coalesced into thread slices by subclassing IndexSearcher and overriding its protected slices method. Each incoming query is then executed concurrently, as long as the server is idle enough to spend multiple CPU cores on one query, with one thread working on each thread slice for that query.

This powerful feature was originally proposed almost 16 years ago by Jean-Fran├žois Halleux and then committed by Doug Cutting himself (hello Doug!) and finally refactored into IndexSearcher almost 9 years ago, and has since undergone a bunch of iterative improvements, many unfolding now thanks to Atri Sharma, recently added new Lucene/Solr committer. Such is the distributed power of passionate open-source software development!

Concurrent query execution is a surprisingly little known sleeper feature in Lucene, since it is not yet exposed in Elasticsearch nor Solr, two popular distributed search applications that build on Lucene. Their concurrency model is instead concurrent search across index shards (usually on different servers) for a single query, but using single-threaded search within each shard.

This means many concurrent independent queries are required in order to saturate cluster wide CPU or IO resources. Until the cluster sees at least that minimum floor QPS, the full hardware resources cannot be utilized. For use cases that often see high query rates, this limitation is acceptable. But other common use-cases that have a large index and lower query rate would benefit substantially from concurrent query execution within a single cluster node if Elasticsearch or Solr were to use this feature.

The real-world effects of Moore's law have shifted: modern server class computers are built with amazing and rapidly increasingly concurrent hardware, not just in their CPUs where we now see 96 cores in the latest c5.24xlarge AWS EC2 instances, but also in their Graphic Processing Units (GPUs), memory bus and DIMMs and solid-state disks (SSDs), which are in fact large concurrent RAID 0 arrays under-the-hood. The recent trend is for CPUs and GPUs to gain more concurrency (cores), and less so for each individual core to get too much faster. Why not use all this increasing concurrency to make all queries faster, and saturate CPU/IO even at low query loads?

Tricky Tradeoffs

Unfortunately, even though searching a Lucene index is a naturally and embarrassingly parallel problem, using multiple threads for one query incurs inherent coordination overhead. To understand why, consider a simple analogy: imagine you need apples, so you send your kids to the local grocery store to buy them. If you have one only child, you send her, she walks around the entire produce section and picks the ten best apples, and brings them home.

But if you have five children and you send all of them to the store, will they come back five times faster, ignoring the "networking" time for them to get to and from the store? How do they efficiently split the work?

Perhaps your children are clever and they first split up all the apple sections in the store (there are many diverse apple choices these days!) into five roughly equal sections. Each runs around their own apple section, picking the ten best apples she can find, and then they all meet up at the checkout counter and work closely to choose the total ten best out of the fifty apples they now have? This is somewhat wasteful, since the children collected fifty apples overall just to choose the actual ten best in the end, but it should indeed be faster than one child picking the ten best overall.

This is effectively how Lucene implements concurrent search today: each searcher thread works alone to find its own top N best hits from one thread slice (the "map" phase), and then, once all query threads have finished and joined back to the main thread, the main thread uses a partial merge sort to find the total top N best hits from the hits collected for each thread slice (the "reduce" phase). Lucene's CollectorManager, Collector and LeafCollector abstractions all work together to implement this. This means more total work is done versus the single threaded case, since now M * N total hits were collected and then reduced to just the top N in the end, where M is the number of concurrent search threads and N is the requested number of top hits to retrieve.

That added coordination cost will necessarily hurt the red-line QPS capacity (throughput) of the search node, when running each query concurrently, since Lucene is spending more total CPU cycles finding the top hits. Yet at the same time, it can greatly improve the long-pole query latencies when the search node has plenty of spare CPU resources, since the hardest queries will now run concurrently. Furthermore, that extra cost of collecting more hits and merging them in the end is often a minor impact overall since it is usually the matching and ranking of each hit that dominates the total query cost, especially as the index grows larger, and that cost is efficiently split across threads.

You can further "amplify" this tradeoff by limiting how many queries can run concurrently, thereby maximizing how many CPU cores will be used for each query. You could also estimate up front how costly each query will be and execute that query concurrently only if its cost is large enough, so that easy queries that would run quickly with a single thread do not pay the overhead of synchronizing across multiple threads.

This throughput versus latency tradeoff is frustrating, and it means it might make sense to use a modal approach for your Lucene application. When the cluster is lightly loaded, use multiple threads per query by restricting how many queries may run concurrently, reducing long-pole latencies. But when the cluster is running hot, approaching its red-line capacity, shift to a single thread per query, to maximize throughput. Be sure you are measuring latencies correctly and your load testing client does not suffer from the all-too-common coordinated omission bug! Confirm your load-testing client is using open-loop testing so you see the true latency impact from, say, a long garbage collection pause, I/O hiccup or swapping.

Ongoing and future improvements

Fortunately, there have been some recent exciting improvements to reduce the added overhead for multi-threaded queries. Lucene now also uses the incoming (calling) thread to help with concurrent searching. The algorithm for grouping small segments into slices (thread work units) has improved. Early termination now uses a single shared global hit counter across multiple search threads for one query, reducing the total cost for the query. Query caching will soon use the Executor to cache concurrently and can even be more efficient in some cases when an Executor is used. Instead of each search thread working fully independently and merging top hits only in the end, they should share information while they concurrently collect such as their worst scoring top hit collected so far or even use a single shared priority queue across all threads. The shared priority queue may incur too much locking, so as a compromise, searching now efficiently shares the best of the worst collected hit across searcher threads, which showed impressive luceneutil benchmark results.

These improvements are reducing the extra cost of concurrent search but that cost can never be zero as there is an inherent natural cost to more frequent thread context switching, lock contention for shared priority queues, hit counters and priority queue bottoms and possibly difficult effects due to modern non-uniform memory architectures (NUMA).

One curious and disappointing limitation of Lucene's concurrent search is that a fully merged index, down to a single segment, loses all concurrency! This is Bizarro World, since normally one merges their index down to a single segment in order to improve query performance! But when you are looking at long-pole query latencies, a fully merged index is unfortunately slower since all queries are now single threaded again even when you pass an Executor to IndexSearcher. Even a single large newly completed merge will cause a sawtooth pattern in your long pole latencies as it reduces the net query concurrency, though the red-line cluster throughput capacity still improves with such merges. One simple idea to address this is to allow multiple threads to search a single large segment, which should work well since Lucene has natural APIs for searching separate regions in "docid space" of the segment.

Concurrent search has come a long way since Jean-Fran├žois Halleux first proposed it for Lucene, and I expect it still has a long way to go, to get the point where we truly minimize the added overhead of using multiple threads for costly queries. As Lucene improves its query planning and optimization we will reach a point where easy queries run single-threaded but costly queries run concurrently and efficiently.  These improvements must come to Lucene: modern servers continue to add more and more cores but are not making those cores too much faster, so it is inevitable that modern software, including Lucene, must find ways to efficiently tap into all this concurrency.

[I work at Amazon and the postings on this site are my own and do not necessarily represent Amazon's positions]

Monday, September 4, 2017

Lucene's near-real-time segment index replication

[TL;DR: Apache Lucene 6.0 quietly introduced a powerful new feature called near-real-time (NRT) segment replication, for efficiently and reliably replicating indices from one server to another, and taking advantage of ever faster and cheaper local area networking technologies. Neither of the popular search servers (Elasticsearch, Solr) are using it yet, but it should bring a big increase in indexing and searching performance and robustness to them.]

Lucene has a unique write-once segmented architecture: recently indexed documents are written to a new self-contained segment, in append-only, write-once fashion: once written, those segment files will never again change. This happens either when too much RAM is being used to hold recently indexed documents, or when you ask Lucene to refresh your searcher so you can search all recently indexed documents.

Over time, smaller segments are merged away into bigger segments, and the index has a logarithmic "staircase" structure of active segment files at any time. This is an unusual design, when compared with databases which continuously update their files in-place, and it bubbles up to all sorts of nice high-level features in Lucene. For example:
  • Efficient ACID transactions.

  • Point-in-time view of the index for searching that will never change, even under concurrent indexing, enabling stable user interactions like pagination and drill-down.

  • Multiple point-in-time snapshots (commit points) can be indefinitely preserved in the index, even under concurrent indexing, useful for taking hot backups of your Lucene index.
The ZFS filesystem's has similar features, such as efficient whole-filesystem snapshots, which are possible because it also uses a write-once design, at the file-block level: when you change a file in your favorite text editor and save it, ZFS allocates new blocks and writes your new version to those blocks, leaving the original blocks unchanged.

This write-once design also empowers Lucene to use optimized data-structures and apply powerful compression techniques when writing index files, because Lucene knows the values will not later change for this segment. For example, you never have to tell Lucene that your doc values field needs only 1 byte of storage, nor that the values are sparse, etc.: Lucene figures that out by itself by looking at all values it is about to write to each new segment.

Finally, and the topic of this post: this design enables Lucene to efficiently replicate a search index from one server ("primary") to another ("replica"): in order to sync recent index changes from primary to replica you only need to look at the file names in the index directory, and not their contents. Any new files must be copied, and any files previously copied do not need to be copied again because they are never changed!

Taking advantage of this, we long ago added a replicator module to Lucene that used exactly this approach, and it works well. However, to use those APIs to replicate recent index changes, each time you would like to sync, you must first commit your changes on the primary index. This is unfortunately a costly operation, invoking fsync on multiple recently written files, and greatly increases the sync latency when compared to a local index opening a new NRT searcher.

Document or segment replication?

The requirement to commit in order to replicate Lucene's recent index changes was such a nasty limitation that when the popular distributed search servers Elasticsearch and Solr added distributed indexing support, they chose not to use Lucene's replication module at all, and instead created their own document replication, where the primary and all replicas redundantly index and merge all incoming documents.

While this might seem like a natural approach to keeping replicas in sync, there are downsides:
  • More costly CPU/IO resource usage across the cluster: all nodes must do the same redundant indexing and merging work, instead of just one primary node. This is especially painful when large merges are running and interfere with concurrent searching, and is more costly on clusters that need many replicas to support high query rates. This alone would give Elasticsearch and Solr a big increase in cluster wide indexing and searching throughput.

  • Risk of inconsistency: ensuring that precisely the same set of documents is indexed in primary and all replicas is tricky and contributed to the problems Elasticsearch has had in the past losing documents when the network is mis-behaving. For example, if one of the replicas throws an exception while indexing a document, or if a network hiccup happens, that replica is now missing a document that the other replicas and primary contain.

  • Costly node recovery after down time: when a replica goes down for a while and then comes back up, it must replay (reindex) any newly indexed documents that arrived while it was down. This can easily be a very large number, requiring a large transaction log and taking a long time to catch up, or it must fallback to a costly full index copy. In contrast, segment based replication only needs to copy the new index files.

  • High code complexity: the code to handle the numerous possible cases where replicas can become out of sync quickly becomes complex, and handling a primary switch (because the old primary crashed) especially so.

  • No "point in time" consistency: the primary and replicas refresh on their own schedules, so they are all typically searching slightly different and incomparable views of the index.
Finally, in Lucene 6.0, overshadowed by other important features like dimensional points, we quietly improved Lucene's replication module with support for NRT segment replication, to copy new segment files after a refresh without first having to call commit. This feature is an especially compelling when combined with the neverending trend towards faster and cheaper local area networking technologies.

How does it work?

While the logical design is straightforward ("just copy over the new segment files from primary to replicas"), the implementation is challenging because this adds yet another concurrent operation (replicas slowly copying index files over the wire), along side the many operations that already happen in IndexWriter such as opening new readers, indexing, deleting, segment merging and committing. Fortunately we were able to build on pre-existing Lucene capabilities like NRT readers, to write new segments and identify all their files, and snapshotting, to ensure segment files are not deleted until replicas have finished copying them.

The two main APIs are PrimaryNode, which holds all state for the primary node including a local IndexWriter instance, and ReplicaNode for the replicas. The replicas act like an index writer, since they also create and delete index files, and so they acquire Lucene's index write lock when they start, to detect inadvertent misuse. When you instantiate each replica, you provide a pointer to where its corresponding primary is running, for example a host or IP address and port.

Both the primary and replica nodes expose a SearcherManager so you can acquire and release the latest searcher at any time. Searching on the primary node also works, and would match the behavior of all Elasticsearch and Solr nodes today (since they always do both indexing and searching), but you might also choose to dedicate the primary node to indexing only.

You use IndexWriter from the primary node to make index changes as usual, and then when you would like to search recently indexed documents, you ask the primary node to refresh. Under the hood, Lucene will open a new NRT reader from its local IndexWriter, gather the index files it references, and notify all connected replicas. The replicas then compute which index files they are missing and then copy them over from the primary.

Document deletions, which normally carry in memory as a bitset directly from IndexWriter to an NRT IndexReader, are instead written through to the filesystem and copied over as well. All files are first copied to temporary files, and then renamed (atomically) in the end if all copies were successful. Lucene's existing end-to-end checksums are used to validate no bits were flipped in transit by a flaky network link, or bad RAM or CPU. Finally, the in-memory segments file (a SegmentInfos instance) is serialized on the wire and sent to the replicas, which then deserialize it and open an NRT searcher via the local SearcherManager. The resulting searcher on the replica is guaranteed to search the exact same point-in-time view as the primary.

This all happens concurrently with ongoing searches on the replica, and optionally primary, nodes. Those searches see the old searcher until the replication finishes and the new searcher is opened. You can also use Lucene's existing SearcherLifetimeManager, which tracks each point-in-time searcher using its long version, if you need to keep older searchers around for a while as well.

The replica and primary nodes both expose independent commit APIs; you can choose to call these based on your durability requirements, and even stagger the commits across nodes to reduce cluster wide search capacity impact.

No transaction log

Note that Lucene does not provide a transaction log! More generally, a cluster of primary + replicas linked by NRT replication behave a lot like a single IndexWriter and NRT reader on a single JVM. This means it is your responsibility to be prepared to replay documents for reindexing since the last commit point, if the whole cluster crashes and starts up again, or if the primary node crashes before replicas were able to copy the new point-in-time refresh.

Note that at the filesystem level, there is no difference between the Lucene index for a primary versus replica node, which makes it simple to shut down the old primary and promote one of the replicas to be a new primary.


With NRT replication, the primary node also does all segment merging. This is important, because merging is a CPU and IO heavy operation, and interferes with ongoing searching.

Once the primary has finished a merge, and before it installs the merged segment in its index, it uses Lucene's merged segment warming API to give all replicas a chance to pre-copy the merged segment. This means that a merge should never block a refresh, so Lucene will keep fast refreshes even as large merged segments are still copying. Once all running replicas have pre-copied the merge, then the primary installs the merged segment, and after the next refresh, so do all the replicas.

We have discussed having replicas perform their own merging, but I suspect that will be a poor tradeoff. Local area networks (e.g. 10 gigabit ethernet) are quickly becoming plenty fast and cheap, and asking a replica to also do merging would necessarily impact search performance. It would also be quite complex trying to ensure replicas perform precisely the same merges as the primary at the same time, and would otherwise break the same point-in-time view across replicas.


The primary and replica nodes are abstract: you must implement certain functions yourself. For example, the low level mechanics of how to copy bytes from primary to replica is up to you. Lucene does not provide that, except in its unit tests which use simple point-to-point TCP "thread per connection" servers. You could choose to use rsync, robocopy, netty servers, a central file server, carrier pigeons, UDP multicast (likely helpful if there are many replicas on the same subnet), etc.

Lucene also does not provide any distributed leader election algorithms to pick a new primary when the current primary has crashed, nor heuristics to detect a downed primary or replica. But once you pick the new primary, Lucene will take care of having all replicas cutover to it, removing stale partially copied files from the old primary, etc.

Finally, Lucene does not provide any load-balancing to direct queries to the least loaded replica, nor any cluster state to keep track of which node is the primary and which are the replicas, though Apache Zookeeper is useful for such shared distributed state.  These parts are all up to you!

Expected failure modes

There are many things that can go wrong with a cluster of servers indexing and searching with NRT index replication, and we wrote an evil randomized stress test case to exercise Lucene's handling in such cases. This test creates a cluster by spawning JVM subprocesses for a primary and multiple replica nodes, and begins indexing and replicating, while randomly applying conditions like an unusually slow network to specific nodes, a network that randomly flips bits, random JVM crashes (SIGSEGV!) of either primary or replica nodes, followed by recovery, etc. This test case uncovered all sorts of fun corner cases!

An especially important case is when the primary node goes down (crashes, loses power or is intentionally killed). In this case, one of the replicas, ideally the replica that was furthest along in copying files from the old primary as decided by a distributed election, is promoted to become the new primary node. All other replicas then switch to the new primary, and in the process must delete any files copied or partially copied from the old primary but not referenced by the new primary. Any documents indexed into the primary but not copied to that replica will need to be indexed again, and this is the caller's responsibility (no transaction log).

If the whole cluster crashes or loses power, on restart you need to determine whichever index (primary or replica) has the "most recent" commit point, and start the primary node on that host and replica nodes on the other hosts. Those replica nodes may need to delete some index files in order to switch to the new primary, and Lucene takes care of that. Finally, you will have to replay any documents that arrived after the last successful commit.

Other fun cases include: a replica crashing while it was still pre-copying a merge; a replica that is falling behind because it is still copying files from the previous refresh when a new refresh happens; a replica going down and coming back up later after the primary node has also changed.


There are also some minor downsides to segment replication:
  • Very new code: this feature is quite new and also quite complex, and not widely used yet. Likely there exciting bugs! Patches welcome!

  • Slightly slower refresh time: after refreshing on the primary, including writing document deletions to disk as well, we must then copy all new files to the replica and open a new searcher on the replica, adding a bit more time before documents are visible for searching on the replica when compared to a straight NRT reader from an IndexWriter.  If this is really a problem, you can use the primary node for searching and have refresh latency very close to what a straight NRT reader provides.

  • Index problems might be replicated: if something goes wrong, and the primary somehow writes a broken index file, then that broken file will be replicated to all replicas too. But this is uncommon these days, especially with Lucene's end to end checksums.


NRT segment replication represents an opportunity for sizable performance and reliability improvements to the popular distributed search servers, especially when combined with the ongoing trend towards faster and cheaper local area networks.  While this feature unfortunately came too late for Elasticsearch and Solr, I am hopeful that the next popular distributed search server, and its users, can benefit from it!

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]

Sunday, July 2, 2017

Lucene gets concurrent deletes and updates!

Long ago, Lucene could only use a single thread to write new segments to disk. The actual indexing of documents, which is the costly process of inverting incoming documents into in-memory segment data structures, could run with multiple threads, but back then, the process of writing those in-memory indices to Lucene segments was single threaded.

We fixed that, more than 6 years ago now, yielding big indexing throughput gains on concurrent hardware.

Today, hardware has only become even more concurrent, and we've finally done the same thing for processing deleted documents and updating doc values!

This change, in time for Lucene's next major release (7.0), shows a 53% indexing throughput speedup when updating whole documents, and a 7.4X - 8.6X speedup when updating doc values, on a private test corpus using highly concurrent hardware (an i3.16xlarge EC2 instance).

Buffering versus applying

When you ask Lucene's IndexWriter to delete a document, or update a document (which is an atomic delete and then add), or to update a doc-values field for a document, you pass it a Term, typically against a primary key field like id, that identifies which document to update. But IndexWriter does not perform the deletion right away. Instead, it buffers up all such deletions and updates, and only finally applies them in bulk once they are using too much RAM, or you refresh your near-real-time reader, or call commit, or a merge needs to kick off.

The process of resolving those terms to actual Lucene document ids is quite costly as Lucene must visit all segments and perform a primary key lookup for each term. Performing lookups in batches gains some efficiency because we sort the terms in unicode order so we can do a single sequential scan through each segment's terms dictionary and postings.

We have also optimized primary key lookups and the buffering of deletes and updates quite a bit over time, with issues like LUCENE-6161, LUCENE-2897, LUCENE-2680, LUCENE-3342. Our fast BlockTree terms dictionary can sometimes save a disk seek for each segment if it can tell from the finite state transducer terms index that the requested term cannot possibly exist in this segment.

Still, as fast as we have made this code, only one thread is allowed to run it at a time, and for update-heavy workloads, that one thread can become a major bottleneck. We've seen users asking about this in the past, because while the deletes are being resolved it looks as if IndexWriter is hung since nothing else is happening. The larger your indexing buffer the longer the hang.

Of course, if you are simply appending new documents to your Lucene index, never updating previously indexed documents, a common use-case these days with the broad adoption of Lucene for log analytics, then none of this matters to you!

Concurrency is hard

With this change, IndexWriter still buffers deletes and updates into packets, but whereas before, when each packet was also buffered for later single-threaded application, instead IndexWriter now immediately resolves the deletes and updates in that packet to the affected documents using the current indexing thread. So you gain as much concurrency as indexing threads you are sending through IndexWriter.

The change was overly difficult because of IndexWriter's terribly complex concurrency, a technical debt I am now convinced we need to address head-on by somehow refactoring IndexWriter. This class is challenging to implement since it must handle so many complex and costly concurrent operations: ongoing indexing, deletes and updates; refreshing new readers; writing new segment files; committing changes to disk; merging segments and adding indexes. There are numerous locks, not just IndexWriter's monitor lock, but also many other internal classes, that make it easy to accidentally trigger a deadlock today. Patches welcome!

The original change also led to some cryptic test failures thanks to our extensive randomized tests, which we are working through for 7.0.

That complex concurrency unfortunately prevented me from making the final step of deletes and updates fully concurent: writing the new segment files. This file writing takes the in-memory resolved doc ids and writes a new per-segment bitset, for deleted documents, or a whole new doc values column per field, for doc values updates.

This is typically a fast operation, except for large indices where a whole column of doc-values updates could be sizable. But since we must do this for every segment that has affected documents, doing this single threaded is definitely still a small bottleneck, so it would be nice, once we succeed in simplifying IndexWriter's concurrency, to also make our file writes concurrent.

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]

Tuesday, March 14, 2017

Apache Lucene 7.0 Is Coming Soon!

The Apache Lucene project will likely release its next major release, 7.0, in a few months!

Remember that Lucene developers generally try hard to backport new features for the next non-major (feature) release, and the upcoming 6.5 already has many great changes, so a new major release is exciting because it means the 7.0-only features, which I now describe, are the particularly big ones that we felt could not be backported for 6.5.

Of course, with every major release, we also do more mundane things like remove deprecated 6.x APIs, and drop support for old indices (written with Lucene 5.x or earlier).

This is only a subset of the new 7.0 only features; for the full list please see the 7.0.0 section in the upcoming CHANGES.txt.

Doc values as iterators

The biggest change in 7.0 is changing doc values from a random access API to a more restrictive iterator API.

Doc values are Lucene's column-stride numeric, sorted or binary per-document field storage across all documents. They can be used to hold scoring signals, such as the single-byte (by default) document length encoding or application-dependent signals, or for sorting, faceting or grouping, or even numeric fields that you might use for range filtering in some queries. Their column-stride storage means it's efficient to visit all values for the one field across documents, in contrast to row-stride storage that stored fields use to retrieve all field values for a single document.

Postings have long been consumed through an iterator, so this was a relatively natural change to make, and the two share the same base class, DocIdSetIterator, to step through or seek to each hit.

The initial rote switch to an iterator API was really just a plumbing swap and less interesting than all the subsequent user-impacting improvements that became possible thanks to the more restrictive API:
With these changes, you finally only pay for what you actually use with doc values, in index size, indexing performance, etc. This is the same as other parts of the index like postings, stored fields, term vectors, etc., and it means users with very sparse doc values no longer see merges taking unreasonably long time or the index becoming unexpectedly huge while merging.

Our nightly sparse benchmarks, based on the NYC Trip Data corpus, show the impressive gains each of the above changes (and more!) accomplished.

Goodbye index-time boosts

Index-time boosting, which lets you increase the a-priori score for a particular document versus other documents, is now deprecated and will be removed in 7.0.

This has always been a fragile feature: it was encoded, along with the field's length, into a single byte value, and thus had very low precision. Furthermore, it is now straightforward to write your custom boost into your own doc values field and use function queries to apply the boost at search time. Finally, with index time boosts gone, length encoding is more accurate, and in particular the first nine length values (1 to 9) are distinct.

Query scoring is simpler

BooleanQuery has long exposed a confusing scoring feature called the coordination factor (coord), to reward hits containing a higher percentage of the search terms. However, this hack is only necessary for scoring models like TF/IDF which have "weak" term saturation such that many occurrences of a single term in a document would be more powerful than adding a single occurence of another term from the query. Since this was specific to one scoring model, TFIDFSimilarity, and since Lucene has now switched to the better Okapi BM25 scoring model by default, we have now fully removed coordination factors in 7.0 from both BooleanQuery and Similarity.

Likewise, the query normalization phase of scoring will be removed. This phase tried to equalize scores across different queries and indices so that they are more comparable, but didn't alter the sort order of hits, and was also TF/IDF specific.

With these scoring simplifications BooleanQuery now makes more aggressive query optimizations when the same sub-clause occurs with different Occur constraints, previously not possible since the scores would change.

Classic Query Parser no longer splits on whitespace

Lucene's original, now called "classic", query parser, always pre-splits the incoming query text on whitespace, and then separately sends those single tokens to the query-time analyzer. This means multi-token filters, such as SynonymGraphFilter or ShingleFilter, will not work.

For example, if the user asks for "denial of service attack" and you had a synonym mapping "denial of service" to DOS, the classic query parser would separately analyze "denial", "of" and "service" so your synonym would never match.

We have already added an option to the query parser to not pre-split on whitespace but left the default unchanged for 6.x releases to preserve backwards compatibility. Finally, with 7.0, we fix that default so that analyzers can see multiple tokens at once, and synonyms will work.

More stuff

As of 7.0 Lucene will (finally!) record into the index metadata which Lucene version was used to originally create it. This knowledge can help us implement future backwards compatibility.

Finite state transducers, used in many ways in Lucene, used to have a complex method call pack which would eek out a few more bytes to further shrink the already small size of the FST. But the code was complex and rarely used and sometimes even made the FST larger so we have removed it for 7.0.

IndexWriter, used to add, update and delete documents in your index, will no longer accept broken token offsets sometimes produced by mis-behaving token filters. Offsets are used for highlighting, and broken offsets, where the end offset for a single token comes before the start offset, or the start offset of a token goes backwards versus the previous token, can only break search-time highlighting. So with this change, Lucene prevents such mistakes at index time by throwing an exception. To ease this transition for cases where users didn't even know their analyzer was producing broken offsets, we've also added a few token filters to "correct" offsets before they are passed to IndexWriter.

Advanced users of Lucene often need to cache something custom for each segment at search time, but the APIs for this are trappy and can lead to unexpected memory leaks so we have overhauled these APIs to reduce the chance of accidental misuse.

Finally, the dimensional points API now takes a field name up front to offer per-field points access, matching how the doc values APIs work.

Lucene 7.0 has not been released, so if you have ideas on any additional major-release-worthy changes you'd like to explore please reach out!

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]

Thursday, October 20, 2016

Jirasearch 2.0 dog food: using Lucene to find our Jira issues

A few years ago I first built and released Jirasearch as a fun dog-food test case for the thin-wrapper Lucene server, to expose a powerful search UI over our Jira issues.

This is a great showcase of a number of Lucene's important features:

Curiously, spell correction, or even fuzzy infix suggestions, is still missing (pull requests welcome!).

Since the initial release of Jirasearch it has seen substantial usage and interest from users and developers. Building this and keeping it running all this time has been an awesome and humbling exercise for me because I get to experience life as a "production" user of our software. At the same time, we all get a nice search UI for finding issues.

Upgrading from Lucene 4.6.x to 6.x

For the past week or so I had another similarly humbling experience, this time upgrading Jirasearch from the very-old Lucene 4.6.x release, to the latest 6.x release. Small (yet vital!) things changed, such as the new requirement to use a special index searcher with ToParentBlockJoinQuery, which conflicts with how you must use DrillSideways. I hit this bug in the infix suggester. Something changed about pure negative boolean queries, but I am still not sure what (I have worked around it for now)!

I had already previously upgraded Lucene server to dimensional points so I got that "for free" for the existing numeric fields in Jirasearch.

New Jirasearch features

Besides "merely" upgrading from Lucene 4.6.x to 6.x, and switching all numeric fields to the new dimensional points, I also added some compelling user-visible improvements (thank you to Alexandre Rafalovitch for suggesting some of these, thus kick-starting my unexpectedly challenging upgrade-and-improve effort):

  • is finally presented as Doug Cutting! Plus, the auto-suggest now works if you type "Doug".
  • The new Updated ago facet dimension lets you drill down to issues that have not been updated for some time.
  • The new Last comment user facet dimension is the user who last commented on an issue.
  • The new Committed by facet dimension lets you drill down to those issues a given developer has committed changes for.
  • The Committed paths hierarchical facet dimension, letting you find issues according to which paths in the source tree were changed for that issue, was broken since we switched from Subversion to Git.
  • The Infrastructure project issues are now included as well.
  • The per-comment text processing sees some minor improvements, e.g. expanding a referenced user name to their display name, mapping commitbot comment link directly to the change set and including the branch name, plus a few new synonyms (try pnp!)

The new facet fields are especially fun: you can now find issues that you perhaps killed, by drilling down on Updated ago > 1 month ago and Last comment user = you (this was the use case suggested by Alexandre).

Another fun one is to see issues a given developer committed (Committed by) to an unusual part of the source tree (Committed paths), e.g. the issues where I committed changes to Solr for a Lucene Jira issue.

Open source Jirasearch

With this update I am also making all the sources behind jirasearch open-source under the Apache 2 license, in the examples/jirasearch sub-directory of the luceneserver github project.

While Luceneserver itself is entirely Java, the sources for the Jirasearch application, to extract details of all issues from the Apache Jira instance, to convert those documents into Lucene server documents, to do a full and near-real-time indexing, building suggestest, and the search UI, are entirely Python.

Please note the Python sources are not particularly pretty. Yet, they are functional, and as always: patches welcome!

It's likely I broke things during this upgrade process; please let me know (add a comment here, or shoot me an email) if so.

Sunday, October 25, 2015

Where are my new blog posts?

Some of you have noticed that I'm not writing much in this blog lately.

But fear not: exciting changes are still happening in Lucene, and I am still writing about them!

It's just that most of what I write is now appearing at either the Elastic blogs or on my Google+ feed, so please head over to those two sources to keep reading about the fun changes in Apache Lucene and Elasticsearch.

Monday, November 17, 2014

Apache Lucene™ 5.0.0 is coming!

At long last, after a strong series of 4.x feature releases, most recently 4.10.2, we are finally working towards another major Apache Lucene release!

There are no promises for the exact timing (it's done when it's done!), but we already have a volunteer release manager (thank you Anshum!).

A major release in Lucene means all deprecated APIs (as of 4.10.x) are dropped, support for 3.x indices is removed while the numerous 4.x index formats are still supported for index backwards compatibility, and the 4.10.x branch becomes our bug-fix only release series (no new features, no API changes).

5.0.0 already contains a number of exciting changes, which I describe below, and they are still rolling in with ongoing active development.

Stronger index safety

Many of the 5.0.0 changes are focused on providing stronger protection against index corruption.

All file access now uses Java's NIO.2 APIs, giving us better error handling (e.g., Files.delete returns a meaningful exception) along with atomic rename for safer commits, reducing the risk of hideous "your entire index is gone" bugs like this doozie.

Lucene's replication module, along with distributed servers on top of Lucene such as Elasticsearch or Solr, must copy index files from one place to another. They do this for backup purposes (e.g., snapshot and restore), for migrating or recovering a shard from one node to another or when adding a new replica. Such replicators try to be incremental, so that if the same file name is present, with the same length and checksum, it will not be copied again.

Unfortunately, these layers sometimes have subtle bugs (they are complex!). Thanks to checksums (added in 4.8.0), Lucene already detects if the replicator caused any bit-flips while copying, and this revealed a long standing nasty bug in the compression library Elasticsearch uses.

With 5.0.0 we take this even further and now detect if whole files were copied to the wrong file name, by assigning a unique id to every segment and commit (segments_N file). Each index file now records the segment id in its header, and then these ids are cross-checked when the index is opened.

The new Lucene50Codec also includes further index corruption detection.

Even CorruptIndexException itself is improved! It will now always refer to the file or resource where the corruption was detected, as this is now a required argument to its constructors. When corruption is detected higher up (e.g., a bad field number in the field infos file), the resulting CorruptIndexException will now state whether there was also a checksum mismatch in the file, helping to narrow the possible source of the corruption.

Finally, during merge, IndexWriter now always checks the incoming segments for corruption before merging. This can mean, on upgrading to 5.0.0, that merging may uncover long-standing latent corruption in an older 4.x index.

Reduced heap usage

5.0.0 also includes several changes to reduce heap usage during indexing and searching.

If your index has 1B docs, then caching a single FixedBitSet-based filter in 4.10.2 costs a non-trivial 125 MB of heap! But with 5.0.0, Lucene now supports random-writable and advance-able sparse bitsets (RoaringDocIdSet and SparseFixedBitSet), so the heap required is in proportion to how many bits are set, not how many total documents exist in the index. These bitsets also greatly simplify how MultiTermQuery is rewritten (no more CONSTANT_SCORE_AUTO_REWRITE_METHOD), and they provide faster advance implementations than FixedBitSet's linear scan. Finally, they provide a more accurate cost() implementation, allowing Lucene to make better choices about how to drive the intersection at query time.

Heap usage during IndexWriter merging is also much lower with the new Lucene50Codec, since doc values and norms for the segments being merged are no longer fully loaded into heap for all fields; now they are loaded for the one field currently being merged, and then dropped.

The default norms format now uses sparse encoding when appropriate, so indices that enable norms for many sparse fields will see a large reduction in required heap at search time.

An explain API for heap usage

If you still find Lucene using more heap than you expected, 5.0.0 has a new API to print a tree structure showing a recursive breakdown of which parts are using how much heap. This is analogous to Lucene's explain API, used to understand why a document has a certain relevance score, but applied to heap usage instead.

It produces output like this:

_cz(5.0.0):C8330469: 28MB 
  postings [...]: 5.2MB 
    field 'latitude' [...]: 678.5KB 
      term index [FST(nodes=6679, ...)]: 678.3KB 
This is a much faster way to see what is using up your heap than trying to stare at a Java heap dump.

Further changes

There is a long tail of additional 5.0.0 changes; here are some of them:

  • Old experimental postings formats (Sep/Fixed/VariableIntPostingsFormat) have been removed. PulsingPostingsFormat has also been removed, since the default postings format already pulses unique terms.

  • FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming than FieldCache.

  • Tokenizers and Analyzers no longer require Reader on init.

  • NormsFormat now gets its own dedicated NormsConsumer/Producer

  • Simplifications to FieldInfo (Lucene's "low schema"): no more normType (it is always a DocValuesType.NUMERIC), no more isIndexed (just check IndexOptions)

  • Compound file handling is simpler, and is now under codec control.

  • SortedSetSortField, used to sort on a multi-valued field, is promoted from sandbox to Lucene's core

  • PostingsFormat now uses a "pull" API when writing postings, just like doc values. This is powerful because you can do things in your postings format that require making more than one pass through the postings such as iterating over all postings for each term to decide which compression format it should use.

  • Version is no longer required on init to classes like IndexWriterConfig and analysis components.

The changes I've described here are just a snapshot of what we have lined up today for a 5.0.0 release. 5.0.0 is still under active development (patches welcome!) so this list will change by the time the actual release is done.

Saturday, August 30, 2014

Scoring tennis using finite-state automata

For some reason having to do with the medieval French, the scoring system for tennis is very strange.

In actuality, the game is easy to explain: to win, you must score at least 4 points and win by at least 2. Yet in practice, you are supposed to use strange labels like "love" (0 points), "15" (1 point), "30" (2 points), "40" (3 points), "deuce" (3 or more points each, and the players are tied), "all" (players are tied) instead of simply tracking points as numbers, as other sports do.

This is of course wildly confusing to newcomers. Fortunately, the convoluted logic is easy to express as a finite-state automaton (FSA):

The game begins in the left-most (unlabeled) state, and then each time either player 1 (red) or player 2 (blue) scores, you advance to the corresponding state to know how to say the score properly in tennis-speak. In each state, player 1's score is first followed by player 2's; for example "40 30" means player 1 has scored 3 points and player 2 has scored 2 and "15 all" means both players have scored once. "adv 2" means player 2 is ahead by 1 point and will win if s/he scores again.

There are only 20 states, and there are cycles which means a tennis game can in fact go on indefinitely, if the players pass back and forth through the "deuce" (translation: game is tied) state.

This FSA is correct, and if you watch a Wimbledon match, for example, you'll see the game advance through precisely these states.


Yet for an FSA, merely being correct is not good enough!

It should also strive to be minimal, and surprisingly this FSA is not: if you build this Automaton in Lucene and minimize it, you'll discover that there are some wasted states! This means 20 states is overkill when deciding who won the game.

Specifically, there is no difference between the "30 all" and "deuce" states, nor between the "30 40" and "adv 2" states, nor between the "40 30" and "adv 1" states. From either state in each of these pairs, there is no sequence of player 1 / player 2 scoring that will result in a different final outcome (this is in principle how the minimization process identifies indistinguishable states).

Therefore, there's no point in keeping those states, and you can safely use this smaller 17-state FSA (15% smaller!) to score your tennis games instead:

For example, from "15 30", if player 1 scores, you go straight to "deuce" and don't bother with the redundant "30 30" state.

Another (simpler?) way to understand why these states are wasted is to recognize that the finite state machine is tracking two different pieces of information: first, how many points ahead player 1 is (since a player must win by 2 points) and second, how many points have been scored (since a player must score at least 4 points to win).

Once enough points (4 or more) have been scored by either player, their absolute scores no longer matter. All that matters is the relative score: whether player 1 is ahead by 1, equal, or behind by 1. For example, we don't care if the score is 197 to 196 or 6 to 5: they are the same thing.

Yet, early on, the FSA must also track the absolute scores, to ensure at least 4 points were scored by the winner. With the original 20-state FSA, the crossover between these two phases was what would have been "40 40" (each player scored 3 points). But in the minimal machine, the crossover became "30 30" (each player scored 2 points), which is safe since each player must still "win by 2" so if player 1 scores 2 points from "30 30", that means player 1 scored 4 points overall.

FSA minimization saved only 3 states for the game of tennis, resulting in a 15% smaller automaton, and maybe this simplifies keeping track of scores in your games by a bit, but in other FSA applications in Lucene, such as the analyzing suggester, MemoryPostingsFormat and the terms index, minimization is vital since it saves substantial disk and RAM for Lucene applications!

Sunday, August 3, 2014

A new proximity query for Lucene, using automatons

The simplest Apache Lucene query, TermQuery, matches any document that contains the specified term, regardless of where the term occurs inside each document. Using BooleanQuery you can combine multiple TermQuerys, with full control over which terms are optional (SHOULD) and which are required (MUST) or required not to be present (MUST_NOT), but still the matching ignores the relative positions of each term inside the document.

Sometimes you do care about the positions of the terms, and for such cases Lucene has various so-called proximity queries.

The simplest proximity query is PhraseQuery, to match a specific sequence of tokens such as "Barack Obama". Seen as a graph, a PhraseQuery is a simple linear chain:

By default the phrase must precisely match, but if you set a non-zero slop factor, a document can still match even when the tokens are not exactly in sequence, as long as the edit distance is within the specified slop. For example, "Barack Obama" with a slop factor of 1 will also match a document containing "Barack Hussein Obama" or "Barack H. Obama". It looks like this graph:

Now there are multiple paths through the graph, including an any (*) transition to match an arbitrary token. (Note: while the graph cannot properly express it, this query would also match a document that had the tokens Barack and Obama on top of one another, at the same position, which is a little bit strange!)

In general, proximity queries are more costly on both CPU and IO resources, since they must load, decode and visit another dimension (positions) for each potential document hit. That said, for exact (no slop) matches, using common-grams, shingles and ngrams to index additional "proximity terms" in the index can provide enormous performance improvements in some cases, at the expense of an increase in index size.

MultiPhraseQuery is another proximity query. It generalizes PhraseQuery by allowing more than one token at each position, for example:

This matches any document containing either domain name system or domain name service. MultiPhraseQuery also accepts a slop factor to allow for non-precise matches.

Finally, span queries (e.g. SpanNearQuery, SpanFirstQuery) go even further, allowing you to build up a complex compound query based on positions where each clause matched. What makes them unique is that you can arbitrarily nest them. For example, you could first build a SpanNearQuery matching Barack Obama with slop=1, then another one matching George Bush, and then make another SpanNearQuery, containing both of those as sub-clauses, matching if they appear within 10 terms of one another.

Introducing TermAutomatonQuery

As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop. Here's an example:

This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that's done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.

In addition to this new query there is also a simple utility class, TokenStreamToTermAutomatonQuery, that provides loss-less translation of any graph TokenStream into the equivalent TermAutomatonQuery. This is powerful because it means even arbitrary token stream graphs will be correctly represented at search time, preserving the PositionLengthAttribute that some tokenizers now set.

While this means you can finally correctly apply arbitrary token stream graph synonyms at query-time, because the index still does not store PositionLengthAttribute, index-time synonyms are still not fully correct. That said, it would be simple to build a TokenFilter that writes the position length into a payload, and then to extend the new TermAutomatonQuery to read from the payload and apply that length during matching (patches welcome!).

The query is likely quite slow, because it assumes every term is optional; in many cases it would be easy to determine required terms (e.g. Obama in the above example) and optimize such cases. In the case where the query was derived from a token stream, so that it has no cycles and does not use any transitions, it may be faster to enumerate all phrases accepted by the automaton (Lucene already has the getFiniteStrings API to do this for any automaton) and construct a boolean query from those phrase queries. This would match the same set of documents, also correctly preserving PositionLengthAttribute, but would assign different scores.

The code is very new and there are surely some exciting bugs! But it should be a nice start for any application that needs precise control over where terms occur inside documents.

Monday, May 12, 2014

Choosing a fast unique identifier (UUID) for Lucene

Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.

Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?

One obvious choice is Java's UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.

BlockTree terms dictionary

The purpose of the terms dictionary is to store all unique terms seen during indexing, and map each term to its metadata (docFreq, totalTermFreq, etc.), as well as the postings (documents, offsets, postings and payloads). When a term is requested, the terms dictionary must locate it in the on-disk index and return its metadata.

The default codec uses the BlockTree terms dictionary, which stores all terms for each field in sorted binary order, and assigns the terms into blocks sharing a common prefix. Each block contains between 25 and 48 terms by default. It uses an in-memory prefix-trie index structure (an FST) to quickly map each prefix to the corresponding on-disk block, and on lookup it first checks the index based on the requested term's prefix, and then seeks to the appropriate on-disk block and scans to find the term.

In certain cases, when the terms in a segment have a predictable pattern, the terms index can know that the requested term cannot exist on-disk. This fast-match test can be a sizable performance gain especially when the index is cold (the pages are not cached by the the OS's IO cache) since it avoids a costly disk-seek. As Lucene is segment-based, a single id lookup must visit each segment until it finds a match, so quickly ruling out one or more segments can be a big win. It is also vital to keep your segment counts as low as possible!

Given this, fully random ids (like UUID V4) should perform worst, because they defeat the terms index fast-match test and require a disk seek for every segment. Ids with a predictable per-segment pattern, such as sequentially assigned values, or a timestamp, should perform best as they will maximize the gains from the terms index fast-match test.

Testing Performance

I created a simple performance tester to verify this; the full source code is here. The test first indexes 100 million ids into an index with 7/7/8 segment structure (7 big segments, 7 medium segments, 8 small segments), and then searches for a random subset of 2 million of the IDs, recording the best time of 5 runs. I used Java 1.7.0_55, on Ubuntu 14.04, with a 3.5 GHz Ivy Bridge Core i7 3770K.

Since Lucene's terms are now fully binary as of 4.0, the most compact way to store any value is in binary form where all 256 values of every byte are used. A 128-bit id value then requires 16 bytes.

I tested the following identifier sources: For the UUIDs and Flake IDs I also tested binary encoding in addition to their standard (base 16 or 36) encoding. Note that I only tested lookup speed using one thread, but the results should scale linearly (on sufficiently concurrent hardware) as you add threads.
Zero-padded sequential ids, encoded in binary are fastest, quite a bit faster than non-zero-padded sequential ids. UUID V4 (using Java's UUID.randomUUID()) is ~4X slower.

But for most applications, sequential ids are not practical. The 2nd fastest is UUID V1, encoded in binary. I was surprised this is so much faster than Flake IDs since Flake IDs use the same raw sources of information (time, node id, sequence) but shuffle the bits differently to preserve total ordering. I suspect the problem is the number of common leading digits that must be traversed in a Flake ID before you get to digits that differ across documents, since the high order bits of the 64-bit timestamp come first, whereas UUID V1 places the low order bits of the 64-bit timestamp first. Perhaps the terms index should optimize the case when all terms in one field share a common prefix.

I also separately tested varying the base from 10, 16, 36, 64, 256 and in general for the non-random ids, higher bases are faster. I was pleasantly surprised by this because I expected a base matching the BlockTree block size (25 to 48) would be best.

There are some important caveats to this test (patches welcome)! A real application would obviously be doing much more work than simply looking up ids, and the results may be different as hotspot must compile much more active code. The index is fully hot in my test (plenty of RAM to hold the entire index); for a cold index I would expect the results to be even more stark since avoiding a disk-seek becomes so much more important. In a real application, the ids using timestamps would be more spread apart in time; I could "simulate" this myself by faking the timestamps over a wider range. Perhaps this would close the gap between UUID V1 and Flake IDs? I used only one thread during indexing, but a real application with multiple indexing threads would spread out the ids across multiple segments at once.

I used Lucene's default TieredMergePolicy, but it is possible a smarter merge policy that favored merging segments whose ids were more "similar" might give better results. The test does not do any deletes/updates, which would require more work during lookup since a given id may be in more than one segment if it had been updated (just deleted in all but one of them).

Finally, I used using Lucene's default Codec, but we have nice postings formats optimized for primary-key lookups when you are willing to trade RAM for faster lookups, such as this Google summer-of-code project from last year and MemoryPostingsFormat. Likely these would provide sizable performance gains!