Thursday, November 10, 2011

SearcherLifetimeManager prevents a broken search user experience

In the past, search indices were usually very static: you built them once, called optimize at the end and shipped them off, and didn't change them very often.

But these days it's just the opposite: most applications have very dynamic indices, constantly being updated with a stream of changes, and you never call optimize anymore.

Lucene's near-real-time search, especially with recent improvements including manager classes to handle the tricky complexities of sharing searchers across threads, offers very fast search turnaround on index changes.

But there is a serious yet often overlooked problem with this approach. To see it, you have to put yourself in the shoes of a user. Imagine Alice comes to your site, runs a search, and is looking through the search results. Not satisfied, after a few seconds she decides to refine that first search. Perhaps she drills down on one of the nice facets you presented, or maybe she clicks to the next page, or picks a different sort criteria (any follow-on action will do). So a new search request is sent back to your server, including the first search plus the requested change (drill down, next page, change sort field, etc.).

How do you handle this follow-on search request? Just pull the latest and greatest searcher from your SearcherManager or NRTManager and search away, right?


If you do this, you risk a broken search experience for Alice, because the new searcher may be different from the original searcher used for Alice's first search request. The differences could be substantial, if you had just opened a new searcher after updating a bunch of documents. This means the results of Alice's follow-on search may have shifted: facet counts are now off, hits are sorted differently so some hits may be duplicated on the second page, or may be lost (if they moved from page 2 to page 1), etc. If you use the new (will be in Lucene 3.5.0) searchAfter API, for efficient paging, the risk is even greater!

Perversely, the frequent searcher reopening that you thought provides such a great user experience by making all search results so fresh, can in fact have just the opposite effect. Each reopen risks breaking all current searches in your application; the more active your site, the more searches you might break!

It's deadly to intentionally break a user's search experience: they will (correctly) conclude your search is buggy, eroding their trust, and then take their business to your competition.

It turns out, this is easy to fix! Instead of pulling the latest searcher for every incoming search request, you should try to pull the same searcher used for the initial search request in the session. This way all follow-on searches see exactly the same index.

Fortunately, there's a new class coming in Lucene 3.5.0, that simplifies this: SearcherLifetimeManager. The class is agnostic to how you obtain the fresh searchers (i.e., SearcherManager, NRTManager, or your own custom source) used for an initial search. Just like Lucene's other manager classes, SearcherLifetimeManager is very easy to use. Create the manager once, up front:
  SearcherLifetimeManager mgr = new SearcherLifetimeManager();
Then, when a search request arrives, if it's an initial (not follow-on) search, obtain the most current searcher in the usual way, but then record this searcher:
  long token = mgr.record(searcher);
The returned token uniquely identifies the specific searcher; you must save it somewhere the user's search results, for example by placing it in a hidden HTML form field.

Later, when the user performs a follow-on search request, make sure the original token is sent back to the server, and then use it to obtain the same searcher:
  // If possible, obtain same searcher version as last
  // search:
  IndexSearcher searcher = mgr.acquire(token);
  if (searcher != null) {
    // Searcher is still here
    try {
      // do searching...
    } finally {
      // Do not use searcher after this!
      searcher = null;
  } else {
    // Searcher was pruned -- notify user session timed
    // out
As long as the original searcher is still available, the manager will return it to you; be sure to release that searcher (ideally in a finally clause).

It's possible searcher is no longer available: for example if Alice ran a new search, but then got hungry, went off to a long lunch, and finally returned then clicked "next page", likely the original searcher will have been pruned!

You should gracefully handle this case, for example by notifying Alice that the search had timed out and asking her to re-submit the original search (which will then get the latest and greatest searcher). Fortunately, you can reduce how often this happens, by controlling how aggressively you prune old searchers:
  mgr.prune(new PruneByAge(600.0));
This removes any searchers older than 10 minutes (you can also implement a custom pruning strategy). You should call it from a separate dedicated thread (not a searcher thread), ideally the same thread that's periodically indexing changes and opening new searchers.

Keeping many searchers around will necessarily tie up resources (open file descriptors, RAM, index files on disk that the IndexWriter would otherwise have deleted). However, because the reopened searchers share sub-readers, the resource consumption will generally be well contained, in proportion to how many index changes occurred between each reopen. Just be sure to use NRTCachingDirectory, to ensure you don't bump up against open file descriptor limits on your operating system (this also gives a good speedup in reopen turnaround time).

Don't erode your users' trust by intentionally breaking their searches!

LUCENE-3486 has the details.

Thursday, November 3, 2011

Near-real-time readers with Lucene's SearcherManager and NRTManager

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that's actively changing the index (i.e., it's in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice "best of both worlds" blend between the traditional immediate and eventual consistency models.

Since reopening an NRT reader bypasses the costly commit, and shares some data structures directly in RAM instead of writing/reading to/from files, it provides extremely fast turnaround time on making index changes visible to searchers. Frequent reopens such as every 50 milliseconds, even under relatively high indexing rates, is easily achievable on modern hardware.

Fortunately, it's trivial to use SearcherManager with NRT readers: use the constructor that takes IndexWriter instead of Directory:
  boolean applyAllDeletes = true;
  ExecutorService es = null;
  SearcherManager mgr = new SearcherManager(writer, applyAllDeletes,
                                            new MySearchWarmer(), es);
This tells SearcherManager that its source for new IndexReaders is the provided IndexWriter instance (instead of a Directory instance). After that, use the SearcherManager just as before.

Typically you'll set the applyAllDeletes boolean to true, meaning each reopened reader is required to apply all previous deletion operations (deleteDocuments or updateDocument/s) up until that point.

Sometimes your usage won't require deletions to be applied. For example, perhaps you index multiple versions of each document over time, always deleting the older versions, yet during searching you have some way to ignore the old versions. If that's the case, you can pass applyAllDeletes=false instead. This will make the turnaround time quite a bit faster, as the primary-key lookups required to resolve deletes can be costly. However, if you're using Lucene's trunk (to be eventually released as 4.0), another option is to use MemoryCodec on your id field to greatly reduce the primary-key lookup time.

Note that some or even all of the previous deletes may still be applied even if you pass false. Also, the pending deletes are never lost if you pass false: they remain buffered and will still eventually be applied.

If you have some searches that can tolerate unapplied deletes and others that cannot, it's perfectly fine to create two SearcherManagers, one applying deletes and one not.

If you pass a non-null ExecutorService, then each segment in the index can be searched concurrently; this is a way to gain concurrency within a single search request. Most applications do not require this, because the concurrency across multiple searches is sufficient. It's also not clear that this is effective in general as it adds per-segment overhead, and the available concurrency is a function of your index structure. Perversely, a fully optimized index will have no concurrency! Most applications should pass null.


What if you want the fast turnaround time of NRT readers, but need control over when specific index changes become visible to certain searches? Use NRTManager!

NRTManager holds onto the IndexWriter instance you provide and then exposes the same APIs for making index changes (addDocument/s, updateDocument/s, deleteDocuments). These methods forward to the underlying IndexWriter, but then return a generation token (a Java long) which you can hold onto after making any given change. The generation only increases over time, so if you make a group of changes, just keep the generation returned from the last change you made.

Then, when a given search request requires certain changes to be visible, pass that generation back to NRTManager to obtain a searcher that's guaranteed to reflect all changes for that generation.

Here's one example use-case: let's say your site has a forum, and you use Lucene to index and search all posts in the forum. Suddenly a user, Alice, comes online and adds a new post; in your server, you take the text from Alice's post and add it as a document to the index, using NRTManager.addDocument, saving the returned generation. If she adds multiple posts, just keep the last generation.

Now, if Alice stops posting and runs a search, you'd like to ensure her search covers all the posts she just made. Of course, if your reopen time is fast enough (say once per second), unless Alice types very quickly, any search she runs will already reflect her posts.

But pretend for now you reopen relatively infrequently (say once every 5 or 10 seconds), and you need to be certain Alice's search covers her posts, so you call NRTManager.waitForGeneration to obtain the SearcherManager to use for searching. If the latest searcher already covers the requested generation, the method returns immediately. Otherwise, it blocks, requesting a reopen (see below), until the required generation has become visible in a searcher, and then returns it.

If some other user, say Bob, doesn't add any posts and runs a search, you don't need to wait for Alice's generation to be visible when obtaining the searcher, since it's far less important when Alice's changes become immediately visible to Bob. There's (usually!) no causal connection between Alice posting and Bob searching, so it's fine for Bob to use the most recent searcher.

Another use-case is an index verifier, where you index a document and then immediately search for it to perform end-to-end validation that the document "made it" correctly into the index. That immediate search must first wait for the returned generation to become available.

The power of NRTManager is you have full control over which searches must see the effects of which indexing changes; this is a further improvement in Lucene's controlled consistency model. NRTManager hides all the tricky details of tracking generations.

But: don't abuse this! You may be tempted to always wait for last generation you indexed for all searches, but this would result in very low search throughput on concurrent hardware since all searches would bunch up, waiting for reopens. With proper usage, only a small subset of searches should need to wait for a specific generation, like Alice; the rest will simply use the most recent searcher, like Bob.

Managing reopens is a little trickier with NRTManager, since you should reopen at higher frequency whenever a search is waiting for a specific generation. To address this, there's the useful NRTManagerReopenThread class; use it like this:
  double minStaleSec = 0.025;
  double maxStaleSec = 5.0;
  NRTManagerReopenThread thread = new NRTManagerReopenThread(
The minStaleSec sets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for a specific generation (Alice, above), meaning the longest such a search should have to wait is approximately 25 msec.

The maxStaleSec sets a lower bound on how frequently reopens should occur. This is used for the periodic "ordinary" reopens, when there is no request waiting for a specific generation (Bob, above); this means any changes done to the index more than approximately 5.0 seconds ago will be seen when Bob searches. Note that these parameters are approximate targets and not hard guarantees on the reader turnaround time. Be sure to eventually call thread.close(), when you are done reopening (for example, on shutting down the application).

You are also free to use your own strategy for calling maybeReopen; you don't have to use NRTManagerReopenThread. Just remember that getting it right, especially when searches are waiting for specific generations, can be tricky!

Tuesday, October 25, 2011

Accuracy and performance of Google's Compact Language Detector

To get a sense of the accuracy and performance of Google's Compact Language Detector, I ran some tests against two other packages:

For the test corpus I used a the corpus described here, created by the author of language-detection. It contains 1000 texts from each of 21 languages, randomly sampled from the Europarl corpus.

It's not a perfect test (no test ever is!): the content is already very clean plain text; there are no domain, language, encoding hints to apply (which you'd normally have with HTML content loaded over HTTP); it "only" covers 21 languages (versus at least 76 that CLD can detect).

CLD and language-detection cover all 21 languages, but Tika is missing Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv), so I only tested on the remaining subset of 17 languages that all three detectors support. This works out to 17,000 texts totalling 2.8 MB.

Many of the texts are very short, making the test challenging: the shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or less.

In addition to the challenges of the corpora, the differences in the detectors make the comparison somewhat apples to oranges. For example, CLD detects at least 76 languages, while language-detection detects 53 and Tika detects 27, so this biases against CLD, and language-detection to a lesser extent, since their classification task is harder relative to Tika's.

For CLD, I disabled its option to abstain (removeWeakMatches), so that it always guesses at the language even when confidence is low, to match the other two detectors. I also turned off the pickSummaryLanguage, as this was also hurting accuracy; now CLD simply picks the highest scoring match as the detected language.

For language-detection, I ran with the default ALPHA of 0.5, and set the random seed to 0.

Here are the raw results:

CLD results (total 98.82% = 16800 / 17000):
     da  93.4%   da=934  nb=54  sv=5  fr=2  eu=2  is=1  hr=1  en=1  
     de  99.6%   de=996  en=2  ga=1  cy=1          
     el  100.0%   el=1000                
     en  100.0%   en=1000                
     es  98.3%   es=983  pt=4  gl=3  en=3  it=2  eu=2  id=1  fi=1  da=1
     et  99.6%   et=996  ro=1  id=1  fi=1  en=1        
     fi  100.0%   fi=1000                
     fr  99.2%   fr=992  en=4  sq=2  de=1  ca=1        
     hu  99.9%   hu=999  it=1              
     it  99.5%   it=995  ro=1  mt=1  id=1  fr=1  eu=1      
     nl  99.5%   nl=995  af=3  sv=1  et=1          
     pl  99.6%   pl=996  tr=1  sw=1  nb=1  en=1        
     pt  98.7%   pt=987  gl=4  es=3  mt=1  it=1  is=1  ht=1  fi=1  en=1
     ro  99.8%   ro=998  da=1  ca=1            
     sk  98.8%   sk=988  cs=9  en=2  de=1          
     sl  95.1%   sl=951  hr=32  sr=8  sk=5  en=2  id=1  cs=1    
     sv  99.0%   sv=990  nb=9  en=1            

Tika results (total 97.12% = 16510 / 17000):
     da  87.6%   da=876  no=112  nl=4  sv=3  it=1  fr=1  et=1  en=1  de=1        
     de  98.5%   de=985  nl=3  it=3  da=3  sv=2  fr=2  sl=1  ca=1          
     el  100.0%   el=1000                        
     en  96.9%   en=969  no=10  it=6  ro=4  sk=3  fr=3  hu=2  et=2  sv=1        
     es  89.8%   es=898  gl=47  pt=22  ca=15  it=6  eo=4  fr=3  fi=2  sk=1  nl=1  et=1    
     et  99.1%   et=991  fi=4  fr=2  sl=1  no=1  ca=1              
     fi  99.4%   fi=994  et=5  hu=1                    
     fr  98.0%   fr=980  sl=6  eo=3  et=2  sk=1  ro=1  no=1  it=1  gl=1  fi=1  es=1  de=1  ca=1
     hu  99.9%   hu=999  ca=1                      
     it  99.4%   it=994  eo=4  pt=1  fr=1                  
     nl  97.8%   nl=978  no=8  de=3  da=3  sl=2  ro=2  pl=1  it=1  gl=1  et=1      
     pl  99.1%   pl=991  sl=3  sk=2  ro=1  it=1  hu=1  fi=1            
     pt  94.4%   pt=944  gl=48  hu=2  ca=2  it=1  et=1  es=1  en=1          
     ro  99.3%   ro=993  is=2  sl=1  pl=1  it=1  hu=1  fr=1            
     sk  96.2%   sk=962  sl=21  pl=13  it=2  ro=1  et=1              
     sl  98.5%   sl=985  sk=7  et=4  it=2  pt=1  no=1              
     sv  97.1%   sv=971  no=15  nl=6  da=6  de=1  ca=1              

Language-detection results (total 99.22% = 16868 / 17000):
     da  97.1%   da=971  no=28  en=1      
     de  99.8%   de=998  da=1  af=1      
     el  100.0%   el=1000          
     en  99.7%   en=997  nl=1  fr=1  af=1    
     es  99.5%   es=995  pt=4  en=1      
     et  99.6%   et=996  fi=2  de=1  af=1    
     fi  99.8%   fi=998  et=2        
     fr  99.8%   fr=998  sv=1  it=1      
     hu  99.9%   hu=999  id=1        
     it  99.8%   it=998  es=2        
     nl  97.7%   nl=977  af=21  sv=1  de=1    
     pl  99.9%   pl=999  nl=1        
     pt  99.4%   pt=994  es=3  it=1  hu=1  en=1  
     ro  99.9%   ro=999  fr=1        
     sk  98.7%   sk=987  cs=8  sl=2  ro=1  lt=1  et=1
     sl  97.2%   sl=972  hr=27  en=1      
     sv  99.0%   sv=990  no=8  da=2      

Some quick analysis:
  • The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!

  • The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.

When language-detection was wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it's not the case that they are all always wrong on the short texts.

This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.

You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with language-detection choice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!

Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:

CLD 171 msec 16.331 MB/sec
language-detection 2367 msec 1.180 MB/sec
Tika 42219 msec 0.066 MB/sec

CLD is incredibly fast! language-detection is an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).

I used the 09-13-2011 release of language-detection, the current trunk (svn revision 1187915) of Apache Tika, and the current trunk (hg revision b0adee43f3b1) of CLD. All sources for the performance tests are available from here.

Monday, October 24, 2011

Additions to Compact Language Detector API

I've made some small improvements after my quick initial port of Google's Compact Language Detection Library, starting with some helpful Python constants:

  • cld.ENCODINGS has all the encoding names recognized by CLD; if you pass the encoding hint it must be one of these.

  • cld.LANGUAGES has the list of all base languages known (but not necessarily detectable) by CLD.

  • cld.EXTERNAL_LANGUAGES has the list of external languages known (but not necessarily detectable) by CLD.

  • cld.DETECTED_LANGUAGES has the list of detectable languages.

I haven't found a reliable way to get the full list of detectable languages;  for now, I've started with all languages that are covered by the unit test, total count 75, which should be a lower bound on the true count.

I also exposed control over whether CLD should abstain from a given matched language if the confidence is too low, by adding a parameter removeWeakMatches (required in C and optional in Python, default False).  Turn this option on if abstaining is OK in your use case, such as a browser toolbar offering to translate content.  Turn it off when testing accuracy vs other language detection libraries (unless they also abstain!).

Finally, CLD has an algorithm that tries to pick the best "summary" language, and it doesn't always just pick the highest scoring match. For example, the code has this comment:
    // If English and X, where X (not UNK) is big enough,
    // assume the English is boilerplate and return X.
See the CalcSummaryLanguage function for more details!

I found this was hurting accuracy in testing so I added a parameter pickSummaryLanguage (default False) to also turn this on or off.

Finally, I fixed the Python binding to release the GIL while CLD is running, so multiple threads can now detect without falsely blocking one another.

Friday, October 21, 2011

Language detection with Google's Compact Language Detector

Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.

Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content.   It looks like CLD was extracted from the language detection library used in Google's toolbar.

It turns out the CLD part of the Chromium source tree is nicely standalone, so I pulled it out into a new separate Google code project, making it possible to use CLD directly from any C++ code.

I also added basic initial Python binding (one method!), and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).

So detecting language is now very simple from Python:
    import cld
    topLanguageName = cld.detect(bytes)[0]
The detect method returns a tuple, including the language name and code (such as RUSSIAN, ru), an isReliable boolean (True if CLD is quite sure of itself), the number of actual text bytes processed, and then details for each of the top languages (up to 3) that were identified.

You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand.

You can also optionally provide hints to the detect method, including the declared encoding and language (for example, from an HTTP header or an embedded META http-equiv tag in the HTML), as well as the domain name suffix (so the top level domain suffix es would boost the chances for detecting Spanish). CLD uses these hints to boost the priors for certain languages. There is this fun comment in the code in front of the tables holding the per-language prior boots:
    Generated by dsites 2008.07.07 from 10% of Base
How I wish I too could build tables off of 10% of Base!

The code itself looks very cool and I suspect (but haven't formally verified!) its quite accurate.  I only understand bits and pieces about how it works; you can read some details here and here.

It's also not clear just how many languages it can detect; I see there are 161 "base" languages plus 44 "extended" languages, but then I see many test cases (102 out of 166!) commented out.  This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.

This port is all still very new, and I extracted CLD quickly, so likely there are some problems still to work out, but the fact that it passes the Python unit test is encouraging.  The README.txt has some more details.

Thank you Google!

Monday, September 26, 2011

Lucene's SearcherManager simplifies reopen with threads

Modern computers have wonderful hardware concurrency, within and across CPU cores, RAM and IO resources, which means your typical server-based search application should use multiple threads to fully utilize all resources.

For searching, this usually means you'll have one thread handle each search request, sharing a single IndexSearcher instance. This model is effective: the Lucene developers work hard to minimize internal locking in all Lucene classes. In fact, we recently removed thread contention during indexing (specifically, flushing), resulting in massive gains in indexing throughput on highly concurrent hardware.

Since IndexSearcher exposes a fixed, point-in-time view of the index, when you make changes to the index you'll need to reopen it. Fortunately, since version 2.9, Lucene has provided the IndexReader.reopen method to get a new reader reflecting the changes.

This operation is efficient: the new reader shares already warmed sub-readers in common with the old reader, so it only opens sub-readers for any newly created segments. This means reopen time is generally in proportion to how many changes you made; however, when a large merge had completed it will be longer. It's best to warm the new reader before putting it into production by running a set of "typical" searches for your application, so that Lucene performs one-time initialization for internal data structures (norms, field cache, etc.).

But how should you properly reopen, while search threads are still running and new searches are forever arriving? Your search application is popular, users are always searching and there's never a good time to switch! The core issue is that you must never close your old IndexReader while other threads are still using it for searching, otherwise those threads can easily hit cryptic exceptions that often mimic index corruption.

Lucene tries to detect that you've done this, and will throw a nice AlreadyClosedException, but we cannot guarantee that exception is thrown since we only check up front, when the search kicks off: if you close the reader when a search is already underway then all bets are off.

One simple approach would be to temporarily block all new searches and wait for all running searches to complete, and then close the old reader and switch to the new one. This is how janitors often clean a bathroom: they wait for all current users to finish and block new users with the all-too-familiar plastic yellow sign.

While the bathroom cleaning approach will work, it has an obviously serious drawback: during the cutover you are now forcing your users to wait, and that wait time could be long (the time for the slowest currently running search to finish).

A much better solution is to immediately direct new searches to the new reader, as soon as it's done warming, and then separately wait for the still-running searches against the old reader to complete. Once the very last search has finished with the old reader, close it.

This solution is fully concurrent: it has no locking whatsoever so searches are never blocked, as long as you use a separate thread to perform the reopen and warming. The time to reopen and warm the new reader has no impact on ongoing searches, except to the extent that reopen consumes CPU, RAM and IO resources to do its job (and, sometimes, this can in fact interfere with ongoing searches).

So how exactly do you implement this approach? The simplest way is to use the reference counting APIs already provided by IndexReader to track how many threads are currently using each searcher. Fortunately, as of Lucene 3.5.0, there will be a new contrib/misc utility class, SearcherManager, originally created as an example for Lucene in Action, 2nd edition, that does this for you! (LUCENE-3445 has the details.)

The class is easy to use. You first create it, by providing the Directory holding your index and a SearchWarmer instance:

  class MySearchWarmer implements SearchWarmer {
    public void warm(IndexSearcher searcher) throws IOException {
      // Run some diverse searches, searching and sorting against all
      // fields that are used by your application

  Directory dir = File("/path/to/index"));
  SearcherManager mgr = new SearcherManager(dir,
                                            new MySearchWarmer());

Then, for each search request:

  IndexSearcher searcher = mgr.acquire();
  try {
    // Do your search, including loading any documents, etc.
  } finally {

    // Set to null to ensure we never again try to use
    // this searcher instance after releasing:
    searcher = null;

Be sure you fully consume searcher before releasing it! A common mistake is to release it yet later accidentally use it again to load stored documents, for rendering the search results for the current page.

Finally, you'll need to periodically call the maybeReopen method from a separate (ie, non-searching) thread. This method will reopen the reader, and only if there was actually a change will it cutover. If your application knows when changes have been committed to the index, you can reopen right after that. Otherwise, you can simply call maybeReopen every X seconds. When there has been no change to the index, the cost of maybeReopen is negligible, so calling it frequently is fine.

Beware the potentially high transient cost of reopen and warm! During reopen, as you must have two readers open until the old one can be closed, you should budget plenty of RAM in the computer and heap for the JVM, to comfortably handle the worst case when the two readers share no sub-readers (for example, after a full optimize) and thus consume 2X the RAM of a single reader. Otherwise you might hit a swap storm or OutOfMemoryError, effectively taking down entire whole search application. Worse, you won't see this problem early on: your first few hundred reopens could easily use only small amounts of added heap, but then suddenly on some unexpected reopen the cost is far higher. Reopening and warming is also generally IO intensive as the reader must load certain index data structures into memory.

Next time I'll describe another utility class, NRTManager, available since version 3.3.0, that you should use instead if your application uses Lucene's fast-turnaround near-real-time (NRT) search. This class solves the same problem (thread-safety during reopening) as SearcherManager but adds a fun twist as it gives you more specific control over which changes must be visible in the newly opened reader.

Thursday, June 30, 2011

Primary key lookups are 2.8X faster with MemoryCodec

A few days ago I committed the new MemoryCodec to Lucene's trunk (to be 4.0). This codec indexes all terms and postings into a compact finite-state transducer (FST) and then, at search time, avoids I/O by performing all terms and postings enumerations in memory using the FST.

If your application needs fast primary-key lookups, and you can afford the required additional memory, this codec might be a good match for the id field. To test this, I switched Lucene's nightly benchmark to use MemoryCodec (just for its id field), and performance jumped from around 179 K to 509 K lookups per second:

This is an awesome improvement! It's particularly impressive as the id field was previously indexed using PulsingCodec, which was already faster than the default StandardCodec.

This is the performance for a single thread, and should scale up linearly if you use multiple threads. Each lookup resolves 4,000 keys in order at once from the id field, performing the lookups segment by segment for best performance (see the source code). The index has 27.6 M docs across multiple segments.

Of course, there is added memory required, specifically 188 MB for this index, which works out to 7.1 bytes per document on average.

There are two sources of MemoryCodec's gains. First, the obvious one: since everything is in memory, you never wait for an I/O seek operation, as long as you ensure the sneaky OS never swaps out your process memory.

Second, I separately added a new seekExact API to TermsEnum, enabling codecs to save CPU if the caller does not need to know the following term when the target term doesn't exist, as is the case here. MemoryCodec has an optimized implementation for seekExact (and so does the cool SimpleTextCodec!). Eventually other codecs should as well, by using the block tree terms index, but we're not there yet.

The id field in the nightly benchmark omits term freq and positions, however MemoryCodec is fully general: you can use it for any field (not just primary-key), storing positions, payloads, etc. Also, its values are zero-padded sequential integers (00000001, 00000002, 00000003, etc.), which is likely important for performance as it allows maximal sharing in the FST. I haven't tested but I suspect had I used something more random, such as GUIDs, memory usage would be higher and lookup performance worse as each segment's FST would be less dense (share less).

Of course, Lucene is not a database, and you normally use it for its fast search performance, not primary-key lookups. The one common search use case where you do require primary-key lookups is during indexing, when deleting or updating documents by an id field. Near-realtime search with updates or deletions relies on this, since the deleted documents must be resolved during reopen, so we also see a healthy speedup in the NRT reopen time:

The NRT latencey dropped from around 52 milliseconds to 43 milliseconds, a 17% improvement. This is "only" 17% because opening a new reader must also do other things like flush the indexed documents as a new segment.

Perhaps more importantly, the variance also dropped substantially, which is expected because with MemoryCodec and NRTCachingDirectory, NRT reopen is fully I/O free (performs no reads or writes when opening a new reader).

One limitation of MemoryCodec is it's an all-or-nothing deal: all terms and postings are in memory, or they aren't. LUCENE-3069, still to be done (any volunteers?), aims to fix this, by enabling you to separately choose whether terms and/or postings data should be in memory.

I suspect an even more specialized codec, for example one that requires the field values to be compact integers, and also requires that the values are unique (only supports primary-key fields), could do even better than MemoryCodec by storing the mapping in global (across all segments) parallel arrays. Such a codec would no longer be general; it'd only work for primary-key fields whose values are compact integers. But it'd have faster lookups than MemoryCodec and should use less memory per document. This codec could simply wrap any other codec, i.e. it would create the arrays on reader initialization, and delegate persisting the postings into the index to the wrapped codec.

Tuesday, June 14, 2011

Near-real-time latency during large merges

I looked into the curious issue I described in my last post, where the NRT reopen delays can become "spikey" (take longer) during a large merge.

To show the issue, I modified the NRT test to kick off a background optimize on startup. This runs a single large merge, creating a 13 GB segment, and indeed produces spikey reopen delays (purple):

The large merge finishes shortly after 7 minutes, after which the reopen delays become healthy again. Search performance (green) is unaffected.

I also added Linux'd dirty bytes to the graph, as reported by /proc/meminfo; it's the saw-tooth blue/green series on the bottom. Note that it's divided by 10, to better fit the Y axis; the peaks are around 800-900 MB.

The large merge writes bytes a fairly high rate (around 30 MB/sec), but Linux buffers those writes in RAM, only actually flushing them to disk every 30 seconds; this is what produces the saw-tooth pattern.

From the graph you can see that the spikey reopen delays generally correlate to when Linux is flushing the dirty pages to disk. Apparently, this heavy write IO interferes with the read IO required when resolving deleted terms to document IDs. To confirm this, I ran the same stress test, but with only adds (no deletions); the reopen delays were then unaffected by the ongoing large merge.

So finally the mystery is explained, but, how to fix it?

I know I could tune Linux's IO, for example to write more frequently, but I'd rather find a Lucene-only solution since we can't expect most users to tune the OS.

One possibility is to make a RAM resident terms dictionary, just for primary-key fields. This could be very compact, for example by using an FST, and should give lookups that never hit disk unless the OS has frustratingly swapped out your RAM data structures. This can also be separately useful for applications that need fast document lookup by primary key, so someone should at some point build this.

Another, lower level idea is to simply rate limit byte/sec written by merges. Since big merges also impact ongoing searches, likely we could help that case as well. To try this out, I made a simple prototype (see LUCENE-3202), and then re-ran the same stress test, limiting all merging to 10 MB/sec:

The optimize now took 3 times longer, and the peak dirty bytes (around 300 MB) is 1/3rd as large, as expected since the IO write rate is limited to 10 MB/sec. But look at the reopen delays: they are now much better contained, averaging around 70 milliseconds while the optimize is running, and dropping to 60 milliseconds once the optimize finishes. I think the ability to limit merging IO is an important feature for Lucene!

Tuesday, June 7, 2011

Lucene's near-real-time search is fast!

Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time. In some cases, such as modern social/news sites (e.g., LinkedIn, Twitter, Facebook, Stack Overflow, Hacker News, DZone, etc.), fast turnaround time is a hard requirement.

Fortunately, it's trivial to use. Just open your initial NRT reader, like this:

// w is your IndexWriter
IndexReader r =, true);

(That's the 3.1+ API; prior to that use w.getReader() instead).

The returned reader behaves just like one opened with it exposes the point-in-time snapshot of the index as of when it was opened. Wrap it in an IndexSearcher and search away!

Once you've made changes to the index, call r.reopen() and you'll get another NRT reader; just be sure to close the old one.

What's special about the NRT reader is that it searches uncommitted changes from IndexWriter, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often commit is called), something not previously possible.

Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes. The reopen time is in proportion to how many changes you made since last reopening that reader.

Lucene's approach is a nice compromise between immediate consistency, where changes are visible after each index change, and eventual consistency, where changes are visible "later" but you don't usually know exactly when.

With NRT, your application has controlled consistency: you decide exactly when changes must become visible.

Recently there have been some good improvements related to NRT:
  • New default merge policy, TieredMergePolicy, which is able to select more efficient non-contiguous merges, and favors segments with more deletions.

  • NRTCachingDirectory takes load off the IO system by caching small segments in RAM (LUCENE-3092).

  • When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (LUCENE-2900).

  • Segments that are 100% deleted are now dropped instead of inefficiently merged (LUCENE-2010).

How fast is NRT search?

I created a simple performance test to answer this. I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents.

Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text. This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as Twitter's recent peak during this year's superbowl (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard.

The test uses updateDocument, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments. In addition, 8 search threads run a fixed TermQuery at the same time.

Finally, the NRT reader is reopened once per second.

I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an OCZ Vertex 3 240 GB SSD, using Oracle's 64 bit Java 1.6.0_21 and Linux Fedora 13. I gave Java a 2 GB max heap, and used MMapDirectory.

The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time:

The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive Dygraph, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph.

Search QPS dropped substantially with time. While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching. They are only truly deleted when the segments are merged. TermQuery is a worst-case query; harder queries, such as BooleanQuery, should see less slowdown from deleted, but not reclaimed, documents.

Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped. It looks like TieredMergePolicy should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off. Once it finished the QPS recovered somewhat.

Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical.

Reopen delay during merging

The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate). There are random single spikes, which is caused by Java running a full GC cycle. However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40). Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay. Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching.

I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities. It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen. However, if that were the case, NRTCachingDirectory (used by the test) should have prevented it, but didn't. It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've set Linux's swappiness to 0.

Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's pdflush tunables could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness.

Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling Directory implementations to pass appropriate flags to the OS when opening files for merging (LUCENE-2793 and LUCENE-2795). From past testing I know that passing O_DIRECT can prevent merges from evicting hot pages, so it's possible this will fix our slow reopen time as well since it bypasses the write cache.

Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges.

This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it. It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity. This would make a nice addition to Lucene!

Except for the slowdown during merging, the performance of NRT is impressive. Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough.

While there are exciting ideas to bring true real-time search to Lucene, by directly searching IndexWriter's RAM buffer as Michael Busch has implemented at Twitter with some cool custom extensions to Lucene, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT.

NIOFSDirectory vs MMapDirectory

Out of curiosity, I ran the exact same test as above, but this time with NIOFSDirectory instead of MMapDirectory:

There are some interesting differences. The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot. For some reason TermQuery, in particular, has high variance from one JVM instance to another.

The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur). The takeaway message seems clear: on Linux, use MMapDirectory not NIOFSDirectory!

Optimizing your NRT turnaround time

My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text). You should test specifically for your use-case what reopen rate works best. Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate.

Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements.

There are also some simple steps you can take to reduce the turnaround time:
  • Store the index on a fast IO system, ideally a modern SSD.

  • Install a merged segment warmer (see IndexWriter.setMergedSegmentWarmer). This warmer is invoked by IndexWriter to warm up a newly merged segment without blocking the reopen of a new NRT reader. If your application uses Lucene's FieldCache or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader.

  • Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices. The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk).

  • If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the Pulsing codec for your id field since this gives faster lookup performance which will make your reopen faster.

  • Use the new NRTCachingDirectory, which buffers small segments in RAM to take load off the IO system (LUCENE-3092).

  • Pass false for applyDeletes when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader.

  • While it's not clear that thread priorities actually work correctly (see this Google Tech Talk), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads. If the machine becomes saturated, ideally only the search threads should take the hit.

Happy near-real-time searching!

Saturday, May 21, 2011

The invisible Lucene bug fixed point

It turns out, the Jira issue tracking system, which we make heavy use of here at Apache, uses Lucene under the hood for searching and browsing issues. This is wonderful since it means Lucene developers are eating their own dog food whenever they use Jira.

Atlassian has opened up some doozy bugs over time, including one of the earliest bug numbers I've ever worked on, LUCENE-140. They sent me a t-shirt for fixing that one (thank you!).

Now, imagine this: what if there were a sneaky bug in Lucene, say a certain text fragment that causes an exception during indexing. A user opens an issue to report this, including the problematic text fragment, yet, because Jira uses Lucene, it hits an exception while indexing that fragment and causes this one bug to be un-searchable and un-viewable when browsing! An invisible bug fixed point.

It's somewhat mind bending to think about, Lucene recursing on itself through Jira, yet it's theoretically possible! Maybe we have a few of invisible bug fixed points lurking already and nobody knows...

Saturday, May 7, 2011

265% indexing speedup with Lucene's concurrent flushing

A week ago, I described the nightly benchmarks we use to catch any unexpected slowdowns in Lucene's performance. Back then the graphs were rather boring (a good thing), but, not anymore! Have a look at the stunning jumps in Lucene's indexing rate:

(Click through the image to see details about what changed on dates A, B, C and D).

Previously we were around 102 GB of plain text per hour, and now it's about 270 GB/hour. That's a 265% jump! Lucene now indexes all of Wikipedia's 23.2 GB (English) export in 5 minutes and 10 seconds.

How did this happen? Concurrent flushing.

That new feature, having lived on a branch for quite some time, undergoing many fun iterations, was finally merged back to trunk about a week ago.

Before concurrent flushing, whenever IndexWriter needed to flush a new segment, it would stop all indexing threads and hijack one thread to perform the rather compute intensive flush. This was a nasty bottleneck on computers with highly concurrent hardware; flushing was inherently single threaded. I previously described the problem here.

But with concurrent flushing, each thread freely flushes its own segment even while other threads continue indexing. No more bottleneck!

Note that there are two separate jumps in the graph. The first jump, the day concurrent flushing landed (labelled as B on the graph), shows the improvement while using only 6 threads and 512 MB RAM buffer during indexing. Those settings resulted in the fastest indexing rate before concurrent flushing.

The second jump (labelled as D on the graph) happened when I increased the indexing threads to 20 and dropped the RAM buffer to 350 MB, giving the fastest indexing rate after concurrent flushing.

One nice side effect of concurrent flushing is that you can now use RAM buffers well over 2.1 GB, as long as you use multiple threads. Curiously, I found that larger RAM buffers slow down overall indexing rate. This might be because of the discontinuity when closing IndexWriter, when we must wait for all the RAM buffers to be written to disk. It would be better to measure steady state indexing rate, while indexing an effectively infinite content source, and ignoring the startup and ending transients; I suspect if I measured that instead, we'd see gains from larger RAM buffers, but this is just speculation at this point.

There were some very challenging changes required to make concurrent flushing work, especially around how IndexWriter handles buffered deletes. Simon Willnauer does a great job describing these changes here and here. Concurrency is tricky!

Remember this change only helps you if you have concurrent hardware, you use enough threads for indexing and there's no other bottleneck (for example, in the content source that provides the documents). Also, if your IO system can't keep up then it will bottleneck your CPU concurrency. The nightly benchmark runs on a computer with 12 real (24 with hyperthreading) cores and a fast (OCZ Vertex 3) solid-state disk. Finally, this feature is not yet released: it was committed to Lucene's trunk, which will eventually be released as 4.0.

Friday, April 29, 2011

Catching slowdowns in Lucene

Lucene has great randomized tests to catch functional failures, but when we accidentally commit a performance regression (we slow down indexing or searching), nothing catches us!

This is scary, because we want things to get only faster with time.

So, when there's a core change that we think may impact performance, we run before/after tests to verify. But this is ad-hoc and error-proned: we could easily forget to do this, or fail to anticipate that a code change might have a performance impact.

Even when we do test performance of a change, the slowdown could be relatively small, easily hiding within the unfortunately often substantial noise of our tests. Over time we might accumulate many such small, unmeasurable slowdowns, suffering the fate of the boiling frog. We do also run performance tests before releasing, but it's better to catch them sooner: solving slowdowns just before releasing is.... dangerous.

To address this problem, I've created a script that runs standard benchmarks on Lucene's trunk (to be 4.0), nightly. It indexes all of Wikipedia's English XML export, three times (with different settings and document sizes), runs a near-real-time (NRT) turnaround time test for 30 minutes, and finally a diverse set of hard queries.

This has been running for a few weeks now, and the results are accessible to anyone.

It's wonderful to see that Lucene's indexing throughput is already a bit faster (~98 GB plain text per hour) than when I last measured!

Near-real-time reopen latency is here; the test measures how long it takes (on average, after discarding outliers) to open a new NRT reader. It's quite intensive, indexing around 1 MB plain text per second as updates (delete+addDocument), and reopening once per second, on the full previously built Wikipedia index.

To put this in perspective, that's almost twice Twitter's recent peak indexing rate during the 2011 Superbowl (4,064 Tweets/second), although Twitter's use-case is harder because the documents are much smaller, and presumably there's additional indexed metadata beyond just the text of the Tweet. Twitter has actually implemented some cool changes to Lucene to enable real-time searching without reopening readers; Michael Busch describes them here and here. Some day I hope these will be folded into Lucene!

Finally, we test all sorts of queries: PhraseQuery (exact and sloppy), FuzzyQuery (edit distance 1 and 2), four variants of BooleanQuery, NumericRangeQuery, PrefixQuery, WildcardQuery, SpanNearQuery, and of course TermQuery. In addition we test the automaton spell checker, and primary-key lookup.

A few days ago, I switched all tests to the very fast 240 GB OCZ Vertex 3 (previously it was a traditional spinning-magnets hard drive). It looks like indexing throughput gained a bit of performance (~102 GB plain text per hour), the search performance was unaffected (expected, because for this test all postings easily fit in available RAM), but the NRT turnaround time saw a drastic reduction in the noise to near-zero. NRT is very IO intensive so it makes sense having a fast IO system improves its turnaround time; I need to dig further into this.

Unfortunately, performance results are inherently noisy. For example you can see the large noise (the error band is +/- one standard deviation) in the TermQuery results; other queries seem to have less noise for some reason.

So far the graphs are rather boring: nice and flat. This is a good thing!

Sunday, April 24, 2011

Just say no to swapping!

Imagine you love to cook; it's an intense hobby of yours. Over time, you've accumulated many fun spices, but your pantry is too small, so, you rent an off-site storage facility, and move the less frequently used spice racks there. Problem solved!

Suddenly you decide to cook this great new recipe. You head to the pantry to retrieve your Saffron, but it's not there! It was moved out to the storage facility and must now be retrieved (this is a hard page fault).

No problem -- your neighbor volunteers to go fetch it for you. Unfortunately, the facility is ~2,900 miles away, all the way across the US, so it takes your friend 6 days to retrieve it!

This assumes you normally take 7 seconds to retrieve a spice from the pantry; that your data was in main memory (~100 nanoseconds access time), not in the CPU's caches (which'd be maybe 10 nanoseconds); that your swap file is on a fast (say, WD Raptor) spinning-magnets hard drive with 5 millisecond average access time; and that your neighbor drives non-stop at 60 mph to the facility and back.

Even worse, your neighbor drives a motorcycle, and so he can only retrieve one spice rack at a time. So, after waiting 6 days for the Saffron to come back, when you next go to the pantry to get some Paprika, it's also "swapped out" and you must wait another 6 days! It's possible that first spice rack also happened to have the Paprika but it's also likely it did not; that depends on your spice locality. Also, with each trip, your neighbor must pick a spice rack to move out to the facility, so that the returned spice rack has a place to go (it is a "swap", after all), so the Paprika could have just been swapped out!

Sadly, it might easily be many weeks until you succeed in cooking your dish.

Maybe in the olden days, when memory itself was a core of little magnets, swapping cost wasn't so extreme, but these days, as memory access time has improved drastically while hard drive access time hasn't budged, the disparity is now unacceptable. Swapping has become a badly leaking abstraction. When a typical process (say, your e-mail reader) has to "swap back in" after not being used for a while, it can hit 100s of such page faults, before finishing redrawing its window. It's an awful experience, though it has the fun side effect of letting you see, in slow motion, just what precise steps your email reader goes through when redrawing its window.

Swapping is especially disastrous with JVM processes. See, the JVM generally won't do a full GC cycle until it has run out of its allowed heap, so most of your heap is likely occupied by not-yet-collected garbage. Since these pages aren't being touched (because they are garbage and thus unreferenced), the OS happily swaps them out. When GC finally runs, you have a ridiculous swap storm, pulling in all these pages only to then discover that they are in fact filled with garbage and should be discarded; this can easily make your GC cycle take many minutes!

It'd be better if the JVM could work more closely with the OS so that GC would somehow run on-demand whenever the OS wants to start swapping so that, at least, we never swap out garbage. Until then, make sure you don't set your JVM's heap size too large!

Just use an SSD...

These days, many machines ship with solid state disks, which are an astounding (though still costly) improvement over spinning magnets; once you've used an SSD you can never go back; it's just one of life's many one-way doors.

You might be tempted to declare that this problem is solved, since SSDs are so blazingly fast, right? Indeed, they are orders of magnitudes faster than spinning magnets, but they are still 2-3 orders of magnitude slower than main memory or CPU cache. The typical SSD might have 50 microsends access time, which equates to ~58 total miles of driving at 60 mph. Certainly a huge improvement, but still unacceptable if you want to cook your dish on time!

Just add RAM...

Another common workaround is to put lots of RAM in your machine, but this can easily back-fire: operating systems will happily swap out memory pages in favor of caching IO pages, so if you have any processes accessing lots of bytes (say, mencoder encoding a 50 GB bluray movie, maybe a virus checker or backup program, or even Lucene searching against a large index or doing a large merge), the OS will swap your pages out. This then means that the more RAM you have, the more swapping you get, and the problem only gets worse!

Fortunately, some OS's let you control this behavior: on Linux, you can tune swappiness down to 0 (most Linux distros default this to a highish number); Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.

There are low-level IO flags that these programs are supposed to use so that the OS knows not to cache the pages they access, but sometimes the processes fail to use them or cannot use them (for example, they are not yet exposed to Java), and even if they do, sometimes the OS ignores them!

When swapping is OK

If your computer never runs any interactive processes, ie, a process where a human is blocked (waiting) on the other end for something to happen, and only runs batch processes which tend to be active at different times, then swapping can be an overall win since it allows that process which is active to make nearly-full use of the available RAM. Net/net, over time, this will give greater overall throughput for the batch processes on the machine.

But, remember that the server running your web-site is an interactive process; if your server processes (web/app server, database, search server, etc.) are stuck swapping, your site has for all intents and purposes become unusable to your users.

This is a fixable problem

Most processes have known data structures that consume substantial RAM, and in many cases these processes could easily discard and later regenerate their data structures in much less time than even a single page fault. Caches can simply be pruned or discarded since they will self-regenerate over time.

These data structures should never be swapped out, since regeneration is far cheaper. Somehow the OS should ask each RAM-intensive and least-recently-accessed process to discard its data structures to free up RAM, instead of swapping out the pages occupied by the data structure. Of course, this would require a tighter interaction between the OS and processes than exists today; Java's SoftReference is close, except this only works within a single JVM, and does not interact with the OS.

What can you do?

Until this problem is solved for real, the simplest workaround is to disable swapping entirely, and stuff as much RAM as you can into the machine. RAM is cheap, memory modules are dense, and modern motherboards accept many modules. This is what I do.

Of course, with this approach, when you run out of RAM stuff will start failing. If the software is well written, it'll fail gracefully: your browser will tell you it cannot open a new window or visit a new page. If it's poorly written it will simply crash, thus quickly freeing up RAM and hopefully not losing any data or corrupting any files in the process. Linux takes the simple draconian approach of picking a memory hogging process and SIGKILL'ing it.

If you don't want to disable swapping you should at least tell the OS not to swap pages out for IO caching.

Just say no to swapping!

Thursday, March 31, 2011

A login-wall is nearly as bad as a pay-wall!

Much has been said and asked about the differences between Stack Overflow and Quora.

And, while there are deep and interesting differences, such as how Stack Overflow makes reputation tracking and badges explicit, in my opinion, one simple difference is the most important of all: Quora's login-wall.

See, you cannot do anything with Quora until you've registered, while with Stack Overflow you can do almost everything without registering. They are polar opposites!

Like everyone else, I have too much curiosity and too little time. I try to keep up on Hacker News (sorry Digg and Reddit): I click through to the cool stuff, and then move on. You have one precious first page impression to rope me in, so don't spend that impression with a login-wall!

I mean, sure, I'm still going to go link up my Facebook account so I can login to Quora and see the questions, answers, conversations. (And, yes, Facebook seems to be winning at the "universal ID" game, even though I like OpenID better.) Still, for each persistent user like me, you've lost 9 non-persistent ones with that dreaded login-wall.

Remember: if you are are a new cool Web site, gaining value from the network effect (as all social sites do), trying to eek out just a tiny slice of all these fickle users jumping around out here, don't put up a login-wall! It's just about as bad as a paywall. Let brand new users do as much as possible with your site, and make that very first page impression count.

Saturday, March 26, 2011

Your test cases should sometimes fail!

I'm an avid subscriber of the delightful weekly (sometimes) Python-URL! email, highlighting the past week's interesting discussions across the numerous Python lists. Each summary starts with the best quote from the week; here's last week's quote:
"So far as I know, that actually just means that the test suite is insufficient." - Peter Seebach, when an application passes all its tests.
I wholeheartedly agree: if your build always passes its tests, that means your tests are not tough enough! Ideally the tests should stay ahead of the software, constantly pulling you forwards to improve its quality. If the tests keep passing, write new ones that fail! Or make existing ones evil-er.

You'll be glad to know that Lucene/Solr's tests do sometimes fail, as you can see in the Hudson Jenkins automated trunk builds.

Randomized testing

Our test infrastructure has gotten much better, just over the past 6 months or so, through heavy use of randomization.

When a test needs a Directory instance, but doesn't care which, it uses the newDirectory method. This method picks one of Lucene's Directory implementations (RAMDirectory, NIOFSDirectory, MMapDirectory, etc.) and then wraps it with MockDirectoryWrapper, a nice little class that does all sorts of fun things like: occasionally calling Thread.yield; preventing still-open files from being overwritten or deleted (acts-like-Windows); refusing to write to the same file twice (verifying Lucene is in fact write-once); breaking up a single writeBytes into multiple calls; optionally throwing IOException on disk full, or simply throwing exceptions at random times; simulating an OS/hardware crash by randomly corrupting un-sync'd files in devilish ways; etc. We pick a timezone and locale.

To randomize indexing, we create a IndexWriterConfig, tweaking all sorts of settings, and use RandomIndexWriter (like IndexWriter, except it sometimes optimizes, commits, yields, etc.). The newField method enables or disables stored fields and term vectors. We create random codecs, per field, by combining a terms dictionary with a random terms index and postings implementations. MockAnalyzer injects payloads into its tokens.

Sometimes we use the PreFlex codec, to writes all indices in the 3.x format (so that we test index backwards compatibility), and sometimes the nifty SimpleText codec. We have exotic methods for creating random yet somewhat realistic full Unicode strings. When creating an IndexSearcher, we might use threads (pass an ExecutorService), or not. We catch tests that leave threads running, or that cause insanity in the FieldCache (for example by loading both parent and sub readers).


To ensure a failure is reproducible, we save the random seeds and on a failure print out a nice line like this:
NOTE: reproduce with: ant test -Dtestcase=TestFieldCacheTermsFilter -Dtestmethod=testMissingTerms -Dtests.seed=-1046382732738729184:5855929314778232889
This fixes the seed so that the test runs deterministically. Sometimes, horribly, we have bugs in this seed logic, thus causing tests to not run deterministically and we scramble to fix those bugs first!

If you happen to hit a test failure, please send that precious line to the dev list! This is like the Search for Extraterrestrial Intelligence (SETI): there are some number of random seeds out there (hopefully, not too many!), that will lead to a failure, and if your computer is lucky enough to discover one of these golden seeds, please share the discovery!

The merging of Lucene and Solr's development was also a big step forward for test coverage, since every change in Lucene is now tested against all of Solr's test cases as well.

Tests accept a multiplier to crank things up, causing them to use more test documents or iterations, run for longer time, etc. We now have perpetual jobs on Jenkins, for both 3.x and trunk, launching every 15 minutes with multiplier 5. We know quickly when someone breaks the build!

This added test coverage has already caught a number of sneaky bugs (including a rare index corruption case on disk-full and a chunking bug in MMapDirectory) that we otherwise would not have discovered for some time.

The test infrastructure itself is so useful that it's now been factored out as a standalone JAR so apps using Lucene can tap into it to create their own fun randomized tests.