We fixed that, more than 6 years ago now, yielding big indexing throughput gains on concurrent hardware.
Today, hardware has only become even more concurrent, and we've finally done the same thing for processing deleted documents and updating doc values!
This change, in time for Lucene's next major release (7.0), shows a 53% indexing throughput speedup when updating whole documents, and a 7.4X - 8.6X speedup when updating doc values, on a private test corpus using highly concurrent hardware (an i3.16xlarge EC2 instance).
Buffering versus applying
When you ask Lucene's
IndexWriter
to delete a document,
or update a document (which is an atomic delete and then add), or to
update a doc-values field for a document, you pass it
a Term
, typically against a primary key field
like id
, that identifies which document to update.
But IndexWriter
does not perform the deletion right away.
Instead, it buffers up all such deletions and updates, and only
finally applies them in bulk once they are using too much RAM, or you
refresh your
near-real-time
reader, or call commit, or a merge needs to kick off.
The process of resolving those terms to actual Lucene document ids is quite costly as Lucene must visit all segments and perform a primary key lookup for each term. Performing lookups in batches gains some efficiency because we sort the terms in unicode order so we can do a single sequential scan through each segment's terms dictionary and postings.
We have also optimized primary key lookups and the buffering of deletes and updates quite a bit over time, with issues like LUCENE-6161, LUCENE-2897, LUCENE-2680, LUCENE-3342. Our fast BlockTree terms dictionary can sometimes save a disk seek for each segment if it can tell from the finite state transducer terms index that the requested term cannot possibly exist in this segment.
Still, as fast as we have made this code, only one thread is allowed to run it at a time, and for update-heavy workloads, that one thread can become a major bottleneck. We've seen users asking about this in the past, because while the deletes are being resolved it looks as if
IndexWriter
is hung since nothing else is happening.
The larger your indexing buffer the longer the hang.
Of course, if you are simply appending new documents to your Lucene index, never updating previously indexed documents, a common use-case these days with the broad adoption of Lucene for log analytics, then none of this matters to you!
Concurrency is hard
With this change,
IndexWriter
still buffers deletes and updates into
packets, but whereas before, when each packet was also
buffered for later single-threaded application, instead IndexWriter
now
immediately resolves the deletes and updates in that packet to the
affected documents using the current indexing thread. So you gain as
much concurrency as indexing threads you are sending
through IndexWriter
.
The change was overly difficult because of
IndexWriter
's
terribly complex concurrency, a technical debt I am now convinced we
need to address head-on by somehow
refactoring IndexWriter
. This class is challenging to
implement since it must handle so many complex and costly concurrent
operations: ongoing indexing, deletes and updates; refreshing new
readers; writing new segment files; committing changes to disk;
merging segments and adding indexes. There are numerous locks, not
just IndexWriter
's monitor lock, but also many other
internal classes, that make it easy to accidentally trigger a deadlock
today. Patches welcome!
The original change also led to some cryptic test failures thanks to our extensive randomized tests, which we are working through for 7.0.
That complex concurrency unfortunately prevented me from making the final step of deletes and updates fully concurent: writing the new segment files. This file writing takes the in-memory resolved doc ids and writes a new per-segment bitset, for deleted documents, or a whole new doc values column per field, for doc values updates.
This is typically a fast operation, except for large indices where a whole column of doc-values updates could be sizable. But since we must do this for every segment that has affected documents, doing this single threaded is definitely still a small bottleneck, so it would be nice, once we succeed in simplifying
IndexWriter
's
concurrency, to also make our file writes concurrent.
[I work at Amazon and the postings on this site are my own and don't necessarily represent Amazon's position]