Friday, April 29, 2011

Catching slowdowns in Lucene

Lucene has great randomized tests to catch functional failures, but when we accidentally commit a performance regression (we slow down indexing or searching), nothing catches us!

This is scary, because we want things to get only faster with time.

So, when there's a core change that we think may impact performance, we run before/after tests to verify. But this is ad-hoc and error-proned: we could easily forget to do this, or fail to anticipate that a code change might have a performance impact.

Even when we do test performance of a change, the slowdown could be relatively small, easily hiding within the unfortunately often substantial noise of our tests. Over time we might accumulate many such small, unmeasurable slowdowns, suffering the fate of the boiling frog. We do also run performance tests before releasing, but it's better to catch them sooner: solving slowdowns just before releasing is.... dangerous.

To address this problem, I've created a script that runs standard benchmarks on Lucene's trunk (to be 4.0), nightly. It indexes all of Wikipedia's English XML export, three times (with different settings and document sizes), runs a near-real-time (NRT) turnaround time test for 30 minutes, and finally a diverse set of hard queries.

This has been running for a few weeks now, and the results are accessible to anyone.

It's wonderful to see that Lucene's indexing throughput is already a bit faster (~98 GB plain text per hour) than when I last measured!

Near-real-time reopen latency is here; the test measures how long it takes (on average, after discarding outliers) to open a new NRT reader. It's quite intensive, indexing around 1 MB plain text per second as updates (delete+addDocument), and reopening once per second, on the full previously built Wikipedia index.

To put this in perspective, that's almost twice Twitter's recent peak indexing rate during the 2011 Superbowl (4,064 Tweets/second), although Twitter's use-case is harder because the documents are much smaller, and presumably there's additional indexed metadata beyond just the text of the Tweet. Twitter has actually implemented some cool changes to Lucene to enable real-time searching without reopening readers; Michael Busch describes them here and here. Some day I hope these will be folded into Lucene!

Finally, we test all sorts of queries: PhraseQuery (exact and sloppy), FuzzyQuery (edit distance 1 and 2), four variants of BooleanQuery, NumericRangeQuery, PrefixQuery, WildcardQuery, SpanNearQuery, and of course TermQuery. In addition we test the automaton spell checker, and primary-key lookup.

A few days ago, I switched all tests to the very fast 240 GB OCZ Vertex 3 (previously it was a traditional spinning-magnets hard drive). It looks like indexing throughput gained a bit of performance (~102 GB plain text per hour), the search performance was unaffected (expected, because for this test all postings easily fit in available RAM), but the NRT turnaround time saw a drastic reduction in the noise to near-zero. NRT is very IO intensive so it makes sense having a fast IO system improves its turnaround time; I need to dig further into this.

Unfortunately, performance results are inherently noisy. For example you can see the large noise (the error band is +/- one standard deviation) in the TermQuery results; other queries seem to have less noise for some reason.

So far the graphs are rather boring: nice and flat. This is a good thing!

26 comments:

  1. This is an excellent piece of testing to have in place on a continuous basis. Is it possible to build on this to also test the performance of Solr?

    Among other things, it would be interesting to see how much the overhead of running a http servlet affects this, and also it would provide a way to determine which solrconfig.xml settings will give optimum performance for different scenarios.

    --PWolanin

    ReplyDelete
  2. It would be great to have similar tests for Solr so we could catch Solr-specific slowdowns.

    It's certainly doable (it's just software!), but I don't think I'm going to have time near term to build this out.

    I would really love to see us get there though...

    Patches welcome ;)

    ReplyDelete
  3. May I ask what is the hardware specification of your testing server?

    ReplyDelete
  4. Hi Anonymous,

    The server has 2 Xeon X5680s so a total of 24 cores (6 cores/cpu X 2 cpus X 2 for hyperthreading), overclocked to 4.0 Ghz. 12 GB of RAM. OS is Fedora 13. Index is written to a 240 GB OCZ Vertex 3, and content is read from a separate spinning-magnets hard drive.

    ReplyDelete
  5. Reading documents from twitter they claim that fully loaded machine with 144M tweets cant handle 5000 RPS with latency 150ms. Looking at your benchmark test they do around 30. I am sure that i am missing something (like i am compering apple to oranges). So i will ask for some explanation please.

    ReplyDelete
  6. Hi Anonymous,

    I don't understand the numbers you're referring to; can you share the documents from twitter? Which benchmark of mine are you getting 30 ms latency from? And is 5000 RPS queries per second (search time) or tweets/second (indexing time)?

    ReplyDelete
  7. Section VII Deployment and Performance

    http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf

    ReplyDelete
  8. Was referring to ruffly 30 QPS. I agree that their situation is to searching few things(tweet text, and usernames) but How they manage to get 5000QPS or i am wrong.

    ReplyDelete
    Replies
    1. or its just that the index is in RAM

      Delete
    2. OK I see, and thanks for sharing the link to the Earlybird paper.

      I'm not familiar with Earlbird's design, but scanning the paper it's clearly been heavily customized to match Twitter's specific needs (e.g. encoding a position in 8 bits since a tweet is at most 140 chars). It's also fully RAM resident, I think, and the search can terminate early since it's always sorted in reverse chronological order.

      Vs the 30 QPS number which is a general-case search on larger docs, using a single thread.

      Delete
    3. Thanks for clearing that, what is common sweetspot for search threads if am using 10 threads(assuming i have tunned lucene right) can i expect 300 QPS. How many threads begin to affect performance or its app specific?

      Delete
    4. oh.. and what it would be the relative performance serving from ram resident index.

      Delete
    5. The speedup from multiple threads depends entirely on how concurrent your hardware is; I'd suggest at most 2*number-of-CPU-cores search threads. If your hardware has 10 fold concurrency (CPU and IO) then yes you should hit 300 QPS with 10 search threads.

      For RAM resident index, it's best to use MMapDirectory and let the OS manage the RAM; if there's is plenty of free RAM for it (ie, you keep your JVM heap sizes low) then it will hold the entire index (or at least the "hot" parts) in RAM. The speedup of a hot index over a cold index is enormous in many cases, because seeking is exceptionally costly for spinning-magnet disks and still costly even for SSDs.

      Delete
    6. Yes i see. Thank you for understanding provided.

      Delete
    7. Can you explain how dictionary are linked with this implementation of posting lists. In traditional case we have dictionary like hashmap[String,List(int,int)] //word -> docid, termfreq. In this case dictionary points to "parallel arrays" slots and in the "poitner array" points to most recent docid in the posting list what means "to search the posting list" in other words how this maps to List(int,int) part

      Delete
    8. Hi, maybe you can ask this on Lucene's dev list? (dev@lucene.apache.org).

      Delete
    9. replied name of thread is Posting list.

      Delete
    10. We just have for every term a posting list where the term occurs.

      Delete
  9. Awesome benchmark program, however I've been unable to get the data files:
    --2013-08-23 15:52:53-- http://people.apache.org/~mikemccand/enwiki-20120502-lines-1k.txt.lzma
    Resolving people.apache.org (people.apache.org)... 140.211.11.9
    Connecting to people.apache.org (people.apache.org)|140.211.11.9|:80... failed:
    Connection timed out.
    Retrying.

    Correct address? Will try later... Thanks!

    ReplyDelete
    Replies
    1. Hi Anonymous,

      That address is correct; try again later / from a different network?

      Delete
  10. Hi Sir, I have a large database( 1 TB ) I want to make index using lucene 4.0 Then I want to re-index or updated index or updated value to be index.. if i delete all index then make index from start then it may take huge time.. please sir,
    Is there any way to update index incremental .....

    ReplyDelete
    Replies
    1. Yes, just open an IndexWriter on the index and add the new documents to it, delete old documents, etc.

      Delete
  11. Hi Mike, I'm trying to read the benchmark as a way to learn the relative cost of different queries. Are the different query results comparable to each other?

    They seems a bit counter intuitive to me: Wildcard query (15 QPS) is just 2x slower than Term query (30 QPS). FuzzyQuery (edit distance 2) is faster than both (40 QPS). Primary key lookup is in another sphere altogether (800 KQPS).
    Perhaps QPS is very close, as I/O is a bottleneck, and in a memory resident index they would be very different?

    ReplyDelete
    Replies
    1. Hi Gili,

      No, each query type is wildly different from the other query types so you really cannot compare them. You can only compare a query type with itself from different days ...

      An in-memory index/codec format will affect different queries differently. E.g. the switch to MemoryPF for the "id" field was a big speedup...

      Delete
  12. Do you have graphs of CPU utilization and disk IOPS during tests?

    ReplyDelete
    Replies
    1. Alas, no, not yet. Patches welcome! Python has the helpful psutil module that should make this straightforward.

      Delete