Friday, September 17, 2010

Lucene's indexing is fast!

Wikipedia periodically exports all of the content on their site, providing a nice corpus for performance testing. I downloaded their most recent English XML export: it uncompresses to a healthy 21 GB of plain text! Then I fully indexed this with Lucene's current trunk (to be 4.0): it took 13 minutes and 9 seconds, or 95.8 GB/hour -- not bad!

Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost. Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc. I use Lucene's contrib/benchmark package to do the indexing; here's the alg I used:


content.source = org.apache.lucene.benchmark.byTask.feeds.LineDocSource
docs.file = /lucene/enwiki-20100904-pages-articles.txt

doc.stored = true
doc.term.vector = false
doc.tokenized = false
doc.body.stored = false
doc.body.tokenized = true


ram.flush.mb = 256


content.source.forever = false


{ "BuildIndex"
[ { "AddDocs" AddDoc > : * ] : 6
- CloseIndex

RepSumByPrefRound BuildIndex

There is no field truncation taking place, since this is now disabled by default -- every token in every Wikipedia article is being indexed. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them. I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. The index only has 4 fields -- title, date, body, and docid.

I've done a few things to speed up the indexing:
  • Increase IndexWriter's RAM buffer from the default 16 MB to 256 MB
  • Run with 6 threads
  • Disable compound file format
  • Reuse document/field instances (contrib/benchmark does this by default)
Lucene's wiki describes additional steps you can take to speed up indexing.

Both the source lines file and index are on an Intel X25-M SSD, and I'm running it on a modern machine, with dual Xeon X5680s, overclocked to 4.0 Ghz, with 12 GB RAM, running Fedora Linux. Java is 64bit 1.6.0_21-b06, and I run as java -server -Xmx2g -Xms2g. I could certainly give it more RAM, but it's not really needed. The resulting index is 6.9 GB.

Out of curiosity, I made a small change to contrib/benchmark, to print the ingest rate over time. It looks like this (over a 100-second window):

Note that a large part (slightly over half!) of the time, the ingest rate is 0; this is not good! This happens because the flushing process, which writes a new segment when the RAM buffer is full, is single-threaded, and, blocks all indexing while it's running. This is a known issue, and is actively being addressed under LUCENE-2324.

Flushing is CPU intensive -- the decode and reencode of the great many vInts is costly. Computers usually have big write caches these days, so the IO shouldn't be a bottleneck. With LUCENE-2324, each indexing thread state will flush its own segment, privately, which will allow us to make full use of CPU concurrency, IO concurrency as well as concurrency across CPUs and the IO system. Once this is fixed, Lucene should be able to make full use of the hardware, ie fully saturate either concurrent CPU or concurrent IO such that whichever is the bottleneck in your context gates your ingest rate. Then maybe we can double this already fast ingest rate!


  1. In addition to the flush issue, I wonder if we shouldn't use a less expensive encoding for segments smaller than X docs (or N bytes), maybe even write much of the data uncompressed?

  2. We should try that!

    With flexible indexing it should be simple to make a codec that switches the underlying encoding (similar to how Pulsing codec works), eg based on doc count in the segment.

    I also want to make codec that decides term by term which encoding to use -- pulsing for very low freq terms, maybe standard for medium freq, and FOR/PFOR for high freq.

  3. If you interested how LUCENE-2324 turned out in its first benchmark I blogged about it today here: