Thursday, November 10, 2011

SearcherLifetimeManager prevents a broken search user experience



In the past, search indices were usually very static: you built them once, called optimize at the end and shipped them off, and didn't change them very often.

But these days it's just the opposite: most applications have very dynamic indices, constantly being updated with a stream of changes, and you never call optimize anymore.

Lucene's near-real-time search, especially with recent improvements including manager classes to handle the tricky complexities of sharing searchers across threads, offers very fast search turnaround on index changes.

But there is a serious yet often overlooked problem with this approach. To see it, you have to put yourself in the shoes of a user. Imagine Alice comes to your site, runs a search, and is looking through the search results. Not satisfied, after a few seconds she decides to refine that first search. Perhaps she drills down on one of the nice facets you presented, or maybe she clicks to the next page, or picks a different sort criteria (any follow-on action will do). So a new search request is sent back to your server, including the first search plus the requested change (drill down, next page, change sort field, etc.).

How do you handle this follow-on search request? Just pull the latest and greatest searcher from your SearcherManager or NRTManager and search away, right?

Wrong!

If you do this, you risk a broken search experience for Alice, because the new searcher may be different from the original searcher used for Alice's first search request. The differences could be substantial, if you had just opened a new searcher after updating a bunch of documents. This means the results of Alice's follow-on search may have shifted: facet counts are now off, hits are sorted differently so some hits may be duplicated on the second page, or may be lost (if they moved from page 2 to page 1), etc. If you use the new (will be in Lucene 3.5.0) searchAfter API, for efficient paging, the risk is even greater!

Perversely, the frequent searcher reopening that you thought provides such a great user experience by making all search results so fresh, can in fact have just the opposite effect. Each reopen risks breaking all current searches in your application; the more active your site, the more searches you might break!

It's deadly to intentionally break a user's search experience: they will (correctly) conclude your search is buggy, eroding their trust, and then take their business to your competition.

It turns out, this is easy to fix! Instead of pulling the latest searcher for every incoming search request, you should try to pull the same searcher used for the initial search request in the session. This way all follow-on searches see exactly the same index.

Fortunately, there's a new class coming in Lucene 3.5.0, that simplifies this: SearcherLifetimeManager. The class is agnostic to how you obtain the fresh searchers (i.e., SearcherManager, NRTManager, or your own custom source) used for an initial search. Just like Lucene's other manager classes, SearcherLifetimeManager is very easy to use. Create the manager once, up front:
  SearcherLifetimeManager mgr = new SearcherLifetimeManager();
Then, when a search request arrives, if it's an initial (not follow-on) search, obtain the most current searcher in the usual way, but then record this searcher:
  long token = mgr.record(searcher);
The returned token uniquely identifies the specific searcher; you must save it somewhere the user's search results, for example by placing it in a hidden HTML form field.

Later, when the user performs a follow-on search request, make sure the original token is sent back to the server, and then use it to obtain the same searcher:
  // If possible, obtain same searcher version as last
  // search:
  IndexSearcher searcher = mgr.acquire(token);
  if (searcher != null) {
    // Searcher is still here
    try {
      // do searching...
    } finally {
      mgr.release(searcher);
      // Do not use searcher after this!
      searcher = null;
    }
  } else {
    // Searcher was pruned -- notify user session timed
    // out
  }
As long as the original searcher is still available, the manager will return it to you; be sure to release that searcher (ideally in a finally clause).

It's possible searcher is no longer available: for example if Alice ran a new search, but then got hungry, went off to a long lunch, and finally returned then clicked "next page", likely the original searcher will have been pruned!

You should gracefully handle this case, for example by notifying Alice that the search had timed out and asking her to re-submit the original search (which will then get the latest and greatest searcher). Fortunately, you can reduce how often this happens, by controlling how aggressively you prune old searchers:
  mgr.prune(new PruneByAge(600.0));
This removes any searchers older than 10 minutes (you can also implement a custom pruning strategy). You should call it from a separate dedicated thread (not a searcher thread), ideally the same thread that's periodically indexing changes and opening new searchers.

Keeping many searchers around will necessarily tie up resources (open file descriptors, RAM, index files on disk that the IndexWriter would otherwise have deleted). However, because the reopened searchers share sub-readers, the resource consumption will generally be well contained, in proportion to how many index changes occurred between each reopen. Just be sure to use NRTCachingDirectory, to ensure you don't bump up against open file descriptor limits on your operating system (this also gives a good speedup in reopen turnaround time).

Don't erode your users' trust by intentionally breaking their searches!

LUCENE-3486 has the details.

Thursday, November 3, 2011

Near-real-time readers with Lucene's SearcherManager and NRTManager

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that's actively changing the index (i.e., it's in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice "best of both worlds" blend between the traditional immediate and eventual consistency models.

Since reopening an NRT reader bypasses the costly commit, and shares some data structures directly in RAM instead of writing/reading to/from files, it provides extremely fast turnaround time on making index changes visible to searchers. Frequent reopens such as every 50 milliseconds, even under relatively high indexing rates, is easily achievable on modern hardware.

Fortunately, it's trivial to use SearcherManager with NRT readers: use the constructor that takes IndexWriter instead of Directory:
  boolean applyAllDeletes = true;
  ExecutorService es = null;
  SearcherManager mgr = new SearcherManager(writer, applyAllDeletes,
                                            new MySearchWarmer(), es);
This tells SearcherManager that its source for new IndexReaders is the provided IndexWriter instance (instead of a Directory instance). After that, use the SearcherManager just as before.

Typically you'll set the applyAllDeletes boolean to true, meaning each reopened reader is required to apply all previous deletion operations (deleteDocuments or updateDocument/s) up until that point.

Sometimes your usage won't require deletions to be applied. For example, perhaps you index multiple versions of each document over time, always deleting the older versions, yet during searching you have some way to ignore the old versions. If that's the case, you can pass applyAllDeletes=false instead. This will make the turnaround time quite a bit faster, as the primary-key lookups required to resolve deletes can be costly. However, if you're using Lucene's trunk (to be eventually released as 4.0), another option is to use MemoryCodec on your id field to greatly reduce the primary-key lookup time.

Note that some or even all of the previous deletes may still be applied even if you pass false. Also, the pending deletes are never lost if you pass false: they remain buffered and will still eventually be applied.

If you have some searches that can tolerate unapplied deletes and others that cannot, it's perfectly fine to create two SearcherManagers, one applying deletes and one not.

If you pass a non-null ExecutorService, then each segment in the index can be searched concurrently; this is a way to gain concurrency within a single search request. Most applications do not require this, because the concurrency across multiple searches is sufficient. It's also not clear that this is effective in general as it adds per-segment overhead, and the available concurrency is a function of your index structure. Perversely, a fully optimized index will have no concurrency! Most applications should pass null.



NRTManager

What if you want the fast turnaround time of NRT readers, but need control over when specific index changes become visible to certain searches? Use NRTManager!

NRTManager holds onto the IndexWriter instance you provide and then exposes the same APIs for making index changes (addDocument/s, updateDocument/s, deleteDocuments). These methods forward to the underlying IndexWriter, but then return a generation token (a Java long) which you can hold onto after making any given change. The generation only increases over time, so if you make a group of changes, just keep the generation returned from the last change you made.

Then, when a given search request requires certain changes to be visible, pass that generation back to NRTManager to obtain a searcher that's guaranteed to reflect all changes for that generation.

Here's one example use-case: let's say your site has a forum, and you use Lucene to index and search all posts in the forum. Suddenly a user, Alice, comes online and adds a new post; in your server, you take the text from Alice's post and add it as a document to the index, using NRTManager.addDocument, saving the returned generation. If she adds multiple posts, just keep the last generation.

Now, if Alice stops posting and runs a search, you'd like to ensure her search covers all the posts she just made. Of course, if your reopen time is fast enough (say once per second), unless Alice types very quickly, any search she runs will already reflect her posts.

But pretend for now you reopen relatively infrequently (say once every 5 or 10 seconds), and you need to be certain Alice's search covers her posts, so you call NRTManager.waitForGeneration to obtain the SearcherManager to use for searching. If the latest searcher already covers the requested generation, the method returns immediately. Otherwise, it blocks, requesting a reopen (see below), until the required generation has become visible in a searcher, and then returns it.

If some other user, say Bob, doesn't add any posts and runs a search, you don't need to wait for Alice's generation to be visible when obtaining the searcher, since it's far less important when Alice's changes become immediately visible to Bob. There's (usually!) no causal connection between Alice posting and Bob searching, so it's fine for Bob to use the most recent searcher.

Another use-case is an index verifier, where you index a document and then immediately search for it to perform end-to-end validation that the document "made it" correctly into the index. That immediate search must first wait for the returned generation to become available.

The power of NRTManager is you have full control over which searches must see the effects of which indexing changes; this is a further improvement in Lucene's controlled consistency model. NRTManager hides all the tricky details of tracking generations.

But: don't abuse this! You may be tempted to always wait for last generation you indexed for all searches, but this would result in very low search throughput on concurrent hardware since all searches would bunch up, waiting for reopens. With proper usage, only a small subset of searches should need to wait for a specific generation, like Alice; the rest will simply use the most recent searcher, like Bob.

Managing reopens is a little trickier with NRTManager, since you should reopen at higher frequency whenever a search is waiting for a specific generation. To address this, there's the useful NRTManagerReopenThread class; use it like this:
  double minStaleSec = 0.025;
  double maxStaleSec = 5.0;
  NRTManagerReopenThread thread = new NRTManagerReopenThread(
                                       nrtManager,
           maxStaleSec,
           minStaleSec);
  thread.start();
  ...
  thread.close();
The minStaleSec sets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for a specific generation (Alice, above), meaning the longest such a search should have to wait is approximately 25 msec.

The maxStaleSec sets a lower bound on how frequently reopens should occur. This is used for the periodic "ordinary" reopens, when there is no request waiting for a specific generation (Bob, above); this means any changes done to the index more than approximately 5.0 seconds ago will be seen when Bob searches. Note that these parameters are approximate targets and not hard guarantees on the reader turnaround time. Be sure to eventually call thread.close(), when you are done reopening (for example, on shutting down the application).

You are also free to use your own strategy for calling maybeReopen; you don't have to use NRTManagerReopenThread. Just remember that getting it right, especially when searches are waiting for specific generations, can be tricky!