Changing Bits: 2017

Monday, September 4, 2017

Lucene's near-real-time segment index replication

[TL;DR: Apache Lucene 6.0 quietly introduced a powerful new feature called near-real-time (NRT) segment replication, for efficiently and reliably replicating indices from one server to another, and taking advantage of ever faster and cheaper local area networking technologies. Neither of the popular search servers (Elasticsearch, Solr) are using it yet, but it should bring a big increase in indexing and searching performance and robustness to them.]

Lucene has a unique write-once segmented architecture: recently indexed documents are written to a new self-contained segment, in append-only, write-once fashion: once written, those segment files will never again change. This happens either when too much RAM is being used to hold recently indexed documents, or when you ask Lucene to refresh your searcher so you can search all recently indexed documents.

Over time, smaller segments are merged away into bigger segments, and the index has a logarithmic "staircase" structure of active segment files at any time. This is an unusual design, when compared with databases which continuously update their files in-place, and it bubbles up to all sorts of nice high-level features in Lucene. For example:

Efficient ACID transactions.
Point-in-time view of the index for searching that will never change, even under concurrent indexing, enabling stable user interactions like pagination and drill-down.
Multiple point-in-time snapshots (commit points) can be indefinitely preserved in the index, even under concurrent indexing, useful for taking hot backups of your Lucene index.

The ZFS filesystem's has similar features, such as efficient whole-filesystem snapshots, which are possible because it also uses a write-once design, at the file-block level: when you change a file in your favorite text editor and save it, ZFS allocates new blocks and writes your new version to those blocks, leaving the original blocks unchanged.

This write-once design also empowers Lucene to use optimized data-structures and apply powerful compression techniques when writing index files, because Lucene knows the values will not later change for this segment. For example, you never have to tell Lucene that your doc values field needs only 1 byte of storage, nor that the values are sparse, etc.: Lucene figures that out by itself by looking at all values it is about to write to each new segment.

Finally, and the topic of this post: this design enables Lucene to efficiently replicate a search index from one server ("primary") to another ("replica"): in order to sync recent index changes from primary to replica you only need to look at the file names in the index directory, and not their contents. Any new files must be copied, and any files previously copied do not need to be copied again because they are never changed!

Taking advantage of this, we long ago added a replicator module to Lucene that used exactly this approach, and it works well. However, to use those APIs to replicate recent index changes, each time you would like to sync, you must first commit your changes on the primary index. This is unfortunately a costly operation, invoking fsync on multiple recently written files, and greatly increases the sync latency when compared to a local index opening a new NRT searcher.

Document or segment replication?

The requirement to commit in order to replicate Lucene's recent index changes was such a nasty limitation that when the popular distributed search servers Elasticsearch and Solr added distributed indexing support, they chose not to use Lucene's replication module at all, and instead created their own document replication, where the primary and all replicas redundantly index and merge all incoming documents.

While this might seem like a natural approach to keeping replicas in sync, there are downsides:

More costly CPU/IO resource usage across the cluster: all nodes must do the same redundant indexing and merging work, instead of just one primary node. This is especially painful when large merges are running and interfere with concurrent searching, and is more costly on clusters that need many replicas to support high query rates. This alone would give Elasticsearch and Solr a big increase in cluster wide indexing and searching throughput.
Risk of inconsistency: ensuring that precisely the same set of documents is indexed in primary and all replicas is tricky and contributed to the problems Elasticsearch has had in the past losing documents when the network is mis-behaving. For example, if one of the replicas throws an exception while indexing a document, or if a network hiccup happens, that replica is now missing a document that the other replicas and primary contain.
Costly node recovery after down time: when a replica goes down for a while and then comes back up, it must replay (reindex) any newly indexed documents that arrived while it was down. This can easily be a very large number, requiring a large transaction log and taking a long time to catch up, or it must fallback to a costly full index copy. In contrast, segment based replication only needs to copy the new index files.
High code complexity: the code to handle the numerous possible cases where replicas can become out of sync quickly becomes complex, and handling a primary switch (because the old primary crashed) especially so.
No "point in time" consistency: the primary and replicas refresh on their own schedules, so they are all typically searching slightly different and incomparable views of the index.

Finally, in Lucene 6.0, overshadowed by other important features like dimensional points, we quietly improved Lucene's replication module with support for NRT segment replication, to copy new segment files after a refresh without first having to call commit. This feature is an especially compelling when combined with the neverending trend towards faster and cheaper local area networking technologies.

How does it work?

While the logical design is straightforward ("just copy over the new segment files from primary to replicas"), the implementation is challenging because this adds yet another concurrent operation (replicas slowly copying index files over the wire), along side the many operations that already happen in IndexWriter such as opening new readers, indexing, deleting, segment merging and committing. Fortunately we were able to build on pre-existing Lucene capabilities like NRT readers, to write new segments and identify all their files, and snapshotting, to ensure segment files are not deleted until replicas have finished copying them.

The two main APIs are PrimaryNode, which holds all state for the primary node including a local IndexWriter instance, and ReplicaNode for the replicas. The replicas act like an index writer, since they also create and delete index files, and so they acquire Lucene's index write lock when they start, to detect inadvertent misuse. When you instantiate each replica, you provide a pointer to where its corresponding primary is running, for example a host or IP address and port.

Both the primary and replica nodes expose a SearcherManager so you can acquire and release the latest searcher at any time. Searching on the primary node also works, and would match the behavior of all Elasticsearch and Solr nodes today (since they always do both indexing and searching), but you might also choose to dedicate the primary node to indexing only.

You use IndexWriter from the primary node to make index changes as usual, and then when you would like to search recently indexed documents, you ask the primary node to refresh. Under the hood, Lucene will open a new NRT reader from its local IndexWriter, gather the index files it references, and notify all connected replicas. The replicas then compute which index files they are missing and then copy them over from the primary.

Document deletions, which normally carry in memory as a bitset directly from IndexWriter to an NRT IndexReader, are instead written through to the filesystem and copied over as well. All files are first copied to temporary files, and then renamed (atomically) in the end if all copies were successful. Lucene's existing end-to-end checksums are used to validate no bits were flipped in transit by a flaky network link, or bad RAM or CPU. Finally, the in-memory segments file (a SegmentInfos instance) is serialized on the wire and sent to the replicas, which then deserialize it and open an NRT searcher via the local SearcherManager. The resulting searcher on the replica is guaranteed to search the exact same point-in-time view as the primary.

This all happens concurrently with ongoing searches on the replica, and optionally primary, nodes. Those searches see the old searcher until the replication finishes and the new searcher is opened. You can also use Lucene's existing SearcherLifetimeManager, which tracks each point-in-time searcher using its long version, if you need to keep older searchers around for a while as well.

The replica and primary nodes both expose independent commit APIs; you can choose to call these based on your durability requirements, and even stagger the commits across nodes to reduce cluster wide search capacity impact.

No transaction log

Note that Lucene does not provide a transaction log! More generally, a cluster of primary + replicas linked by NRT replication behave a lot like a single IndexWriter and NRT reader on a single JVM. This means it is your responsibility to be prepared to replay documents for reindexing since the last commit point, if the whole cluster crashes and starts up again, or if the primary node crashes before replicas were able to copy the new point-in-time refresh.

Note that at the filesystem level, there is no difference between the Lucene index for a primary versus replica node, which makes it simple to shut down the old primary and promote one of the replicas to be a new primary.

Merging

With NRT replication, the primary node also does all segment merging. This is important, because merging is a CPU and IO heavy operation, and interferes with ongoing searching.

Once the primary has finished a merge, and before it installs the merged segment in its index, it uses Lucene's merged segment warming API to give all replicas a chance to pre-copy the merged segment. This means that a merge should never block a refresh, so Lucene will keep fast refreshes even as large merged segments are still copying. Once all running replicas have pre-copied the merge, then the primary installs the merged segment, and after the next refresh, so do all the replicas.

We have discussed having replicas perform their own merging, but I suspect that will be a poor tradeoff. Local area networks (e.g. 10 gigabit ethernet) are quickly becoming plenty fast and cheap, and asking a replica to also do merging would necessarily impact search performance. It would also be quite complex trying to ensure replicas perform precisely the same merges as the primary at the same time, and would otherwise break the same point-in-time view across replicas.

Abstractions

The primary and replica nodes are abstract: you must implement certain functions yourself. For example, the low level mechanics of how to copy bytes from primary to replica is up to you. Lucene does not provide that, except in its unit tests which use simple point-to-point TCP "thread per connection" servers. You could choose to use rsync, robocopy, netty servers, a central file server, carrier pigeons, UDP multicast (likely helpful if there are many replicas on the same subnet), etc.

Lucene also does not provide any distributed leader election algorithms to pick a new primary when the current primary has crashed, nor heuristics to detect a downed primary or replica. But once you pick the new primary, Lucene will take care of having all replicas cutover to it, removing stale partially copied files from the old primary, etc.

Finally, Lucene does not provide any load-balancing to direct queries to the least loaded replica, nor any cluster state to keep track of which node is the primary and which are the replicas, though Apache Zookeeper is useful for such shared distributed state. These parts are all up to you!

Expected failure modes

There are many things that can go wrong with a cluster of servers indexing and searching with NRT index replication, and we wrote an evil randomized stress test case to exercise Lucene's handling in such cases. This test creates a cluster by spawning JVM subprocesses for a primary and multiple replica nodes, and begins indexing and replicating, while randomly applying conditions like an unusually slow network to specific nodes, a network that randomly flips bits, random JVM crashes (SIGSEGV!) of either primary or replica nodes, followed by recovery, etc. This test case uncovered all sorts of fun corner cases!

An especially important case is when the primary node goes down (crashes, loses power or is intentionally killed). In this case, one of the replicas, ideally the replica that was furthest along in copying files from the old primary as decided by a distributed election, is promoted to become the new primary node. All other replicas then switch to the new primary, and in the process must delete any files copied or partially copied from the old primary but not referenced by the new primary. Any documents indexed into the primary but not copied to that replica will need to be indexed again, and this is the caller's responsibility (no transaction log).

If the whole cluster crashes or loses power, on restart you need to determine whichever index (primary or replica) has the "most recent" commit point, and start the primary node on that host and replica nodes on the other hosts. Those replica nodes may need to delete some index files in order to switch to the new primary, and Lucene takes care of that. Finally, you will have to replay any documents that arrived after the last successful commit.

Other fun cases include: a replica crashing while it was still pre-copying a merge; a replica that is falling behind because it is still copying files from the previous refresh when a new refresh happens; a replica going down and coming back up later after the primary node has also changed.

Downsides

There are also some minor downsides to segment replication:

Very new code: this feature is quite new and also quite complex, and not widely used yet. Likely there exciting bugs! Patches welcome!
Slightly slower refresh time: after refreshing on the primary, including writing document deletions to disk as well, we must then copy all new files to the replica and open a new searcher on the replica, adding a bit more time before documents are visible for searching on the replica when compared to a straight NRT reader from an IndexWriter. If this is really a problem, you can use the primary node for searching and have refresh latency very close to what a straight NRT reader provides.
Index problems might be replicated: if something goes wrong, and the primary somehow writes a broken index file, then that broken file will be replicated to all replicas too. But this is uncommon these days, especially with Lucene's end to end checksums.

Concluding

NRT segment replication represents an opportunity for sizable performance and reliability improvements to the popular distributed search servers, especially when combined with the ongoing trend towards faster and cheaper local area networks. While this feature unfortunately came too late for Elasticsearch and Solr, I am hopeful that the next popular distributed search server, and its users, can benefit from it!

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]

Sunday, July 2, 2017

Lucene gets concurrent deletes and updates!

Long ago, Lucene could only use a single thread to write new segments to disk. The actual indexing of documents, which is the costly process of inverting incoming documents into in-memory segment data structures, could run with multiple threads, but back then, the process of writing those in-memory indices to Lucene segments was single threaded.

We fixed that, more than 6 years ago now, yielding big indexing throughput gains on concurrent hardware.

Today, hardware has only become even more concurrent, and we've finally done the same thing for processing deleted documents and updating doc values!

This change, in time for Lucene's next major release (7.0), shows a 53% indexing throughput speedup when updating whole documents, and a 7.4X - 8.6X speedup when updating doc values, on a private test corpus using highly concurrent hardware (an i3.16xlarge EC2 instance).

Buffering versus applying

When you ask Lucene's IndexWriter to delete a document, or update a document (which is an atomic delete and then add), or to update a doc-values field for a document, you pass it a Term, typically against a primary key field like id, that identifies which document to update. But IndexWriter does not perform the deletion right away. Instead, it buffers up all such deletions and updates, and only finally applies them in bulk once they are using too much RAM, or you refresh your near-real-time reader, or call commit, or a merge needs to kick off.

The process of resolving those terms to actual Lucene document ids is quite costly as Lucene must visit all segments and perform a primary key lookup for each term. Performing lookups in batches gains some efficiency because we sort the terms in unicode order so we can do a single sequential scan through each segment's terms dictionary and postings.

We have also optimized primary key lookups and the buffering of deletes and updates quite a bit over time, with issues like LUCENE-6161, LUCENE-2897, LUCENE-2680, LUCENE-3342. Our fast BlockTree terms dictionary can sometimes save a disk seek for each segment if it can tell from the finite state transducer terms index that the requested term cannot possibly exist in this segment.

Still, as fast as we have made this code, only one thread is allowed to run it at a time, and for update-heavy workloads, that one thread can become a major bottleneck. We've seen users asking about this in the past, because while the deletes are being resolved it looks as if IndexWriter is hung since nothing else is happening. The larger your indexing buffer the longer the hang.

Of course, if you are simply appending new documents to your Lucene index, never updating previously indexed documents, a common use-case these days with the broad adoption of Lucene for log analytics, then none of this matters to you!

Concurrency is hard

With this change, IndexWriter still buffers deletes and updates into packets, but whereas before, when each packet was also buffered for later single-threaded application, instead IndexWriter now immediately resolves the deletes and updates in that packet to the affected documents using the current indexing thread. So you gain as much concurrency as indexing threads you are sending through IndexWriter.

The change was overly difficult because of IndexWriter's terribly complex concurrency, a technical debt I am now convinced we need to address head-on by somehow refactoring IndexWriter. This class is challenging to implement since it must handle so many complex and costly concurrent operations: ongoing indexing, deletes and updates; refreshing new readers; writing new segment files; committing changes to disk; merging segments and adding indexes. There are numerous locks, not just IndexWriter's monitor lock, but also many other internal classes, that make it easy to accidentally trigger a deadlock today. Patches welcome!

The original change also led to some cryptic test failures thanks to our extensive randomized tests, which we are working through for 7.0.

That complex concurrency unfortunately prevented me from making the final step of deletes and updates fully concurent: writing the new segment files. This file writing takes the in-memory resolved doc ids and writes a new per-segment bitset, for deleted documents, or a whole new doc values column per field, for doc values updates.

This is typically a fast operation, except for large indices where a whole column of doc-values updates could be sizable. But since we must do this for every segment that has affected documents, doing this single threaded is definitely still a small bottleneck, so it would be nice, once we succeed in simplifying IndexWriter's concurrency, to also make our file writes concurrent.

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]

Tuesday, March 14, 2017

Apache Lucene 7.0 Is Coming Soon!

The Apache Lucene project will likely release its next major release, 7.0, in a few months!

Remember that Lucene developers generally try hard to backport new features for the next non-major (feature) release, and the upcoming 6.5 already has many great changes, so a new major release is exciting because it means the 7.0-only features, which I now describe, are the particularly big ones that we felt could not be backported for 6.5.

Of course, with every major release, we also do more mundane things like remove deprecated 6.x APIs, and drop support for old indices (written with Lucene 5.x or earlier).

This is only a subset of the new 7.0 only features; for the full list please see the 7.0.0 section in the upcoming CHANGES.txt.

Doc values as iterators

The biggest change in 7.0 is changing doc values from a random access API to a more restrictive iterator API.

Doc values are Lucene's column-stride numeric, sorted or binary per-document field storage across all documents. They can be used to hold scoring signals, such as the single-byte (by default) document length encoding or application-dependent signals, or for sorting, faceting or grouping, or even numeric fields that you might use for range filtering in some queries. Their column-stride storage means it's efficient to visit all values for the one field across documents, in contrast to row-stride storage that stored fields use to retrieve all field values for a single document.

Postings have long been consumed through an iterator, so this was a relatively natural change to make, and the two share the same base class, DocIdSetIterator, to step through or seek to each hit.

The initial rote switch to an iterator API was really just a plumbing swap and less interesting than all the subsequent user-impacting improvements that became possible thanks to the more restrictive API:

The 7.0 codec now sparsely encodes sparse doc values and length normalization factors ("norms")
Outlier values no longer consume excessive space
Our doc values based queries take advantage of the new API
Both top-level browse-only facet counts and facet counts for hits in a query are now faster in sparse cases
A new advanceExact method enables more efficient skipping

With these changes, you finally only pay for what you actually use with doc values, in index size, indexing performance, etc. This is the same as other parts of the index like postings, stored fields, term vectors, etc., and it means users with very sparse doc values no longer see merges taking unreasonably long time or the index becoming unexpectedly huge while merging.

Our nightly sparse benchmarks, based on the NYC Trip Data corpus, show the impressive gains each of the above changes (and more!) accomplished.

Goodbye index-time boosts

Index-time boosting, which lets you increase the a-priori score for a particular document versus other documents, is now deprecated and will be removed in 7.0.

This has always been a fragile feature: it was encoded, along with the field's length, into a single byte value, and thus had very low precision. Furthermore, it is now straightforward to write your custom boost into your own doc values field and use function queries to apply the boost at search time. Finally, with index time boosts gone, length encoding is more accurate, and in particular the first nine length values (1 to 9) are distinct.

Query scoring is simpler

BooleanQuery has long exposed a confusing scoring feature called the coordination factor (coord), to reward hits containing a higher percentage of the search terms. However, this hack is only necessary for scoring models like TF/IDF which have "weak" term saturation such that many occurrences of a single term in a document would be more powerful than adding a single occurence of another term from the query. Since this was specific to one scoring model, TFIDFSimilarity, and since Lucene has now switched to the better Okapi BM25 scoring model by default, we have now fully removed coordination factors in 7.0 from both BooleanQuery and Similarity.

Likewise, the query normalization phase of scoring will be removed. This phase tried to equalize scores across different queries and indices so that they are more comparable, but didn't alter the sort order of hits, and was also TF/IDF specific.

With these scoring simplifications BooleanQuery now makes more aggressive query optimizations when the same sub-clause occurs with different Occur constraints, previously not possible since the scores would change.

Classic Query Parser no longer splits on whitespace

Lucene's original, now called "classic", query parser, always pre-splits the incoming query text on whitespace, and then separately sends those single tokens to the query-time analyzer. This means multi-token filters, such as SynonymGraphFilter or ShingleFilter, will not work.

For example, if the user asks for "denial of service attack" and you had a synonym mapping "denial of service" to DOS, the classic query parser would separately analyze "denial", "of" and "service" so your synonym would never match.

We have already added an option to the query parser to not pre-split on whitespace but left the default unchanged for 6.x releases to preserve backwards compatibility. Finally, with 7.0, we fix that default so that analyzers can see multiple tokens at once, and synonyms will work.

More stuff

As of 7.0 Lucene will (finally!) record into the index metadata which Lucene version was used to originally create it. This knowledge can help us implement future backwards compatibility.

Finite state transducers, used in many ways in Lucene, used to have a complex method call pack which would eek out a few more bytes to further shrink the already small size of the FST. But the code was complex and rarely used and sometimes even made the FST larger so we have removed it for 7.0.

IndexWriter, used to add, update and delete documents in your index, will no longer accept broken token offsets sometimes produced by mis-behaving token filters. Offsets are used for highlighting, and broken offsets, where the end offset for a single token comes before the start offset, or the start offset of a token goes backwards versus the previous token, can only break search-time highlighting. So with this change, Lucene prevents such mistakes at index time by throwing an exception. To ease this transition for cases where users didn't even know their analyzer was producing broken offsets, we've also added a few token filters to "correct" offsets before they are passed to IndexWriter.

Advanced users of Lucene often need to cache something custom for each segment at search time, but the APIs for this are trappy and can lead to unexpected memory leaks so we have overhauled these APIs to reduce the chance of accidental misuse.

Finally, the dimensional points API now takes a field name up front to offer per-field points access, matching how the doc values APIs work.

Lucene 7.0 has not been released, so if you have ideas on any additional major-release-worthy changes you'd like to explore please reach out!

[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]