Tuesday, October 5, 2010

Lucene's SimpleText codec

Inspired by this question on the Lucene user's list, I created a new codec in Lucene called the SimpleText codec. The best ideas come from the user's lists!

This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0). Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...

Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:

field contents
term file
doc 0
pos 5
term is
doc 0
pos 1
term second
doc 0
pos 3
term test
doc 0
pos 4
term the
doc 0
pos 2
term this
doc 0
pos 0
END


The codec is read/write, and fully functional. All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).

Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc. So don't use this one in production!

But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.

9 comments:

  1. Mike, I'm confused as how to read using the codecs. I see that I can create an IndexWriter using an IndexWriterConfig that has a setCodec method, but I don't see any such method with the IndexReader class?

    ReplyDelete
  2. Hi Aryeh,

    The necessary codec is written into the index every time a segment is flushed or merged, by IndexWriter.

    So that at read time (when you open an IndexReader) you can't change the codec anymore; instead you just have to ensure all codecs used when writing that index are still on the CLASSPATH. IndexReader will look at each segment, determine which codec wrote it, find that codec in the CLASSPATH, and use it to open the segment.

    ReplyDelete
  3. Do you have a link to your codec? (This link above is dead.)

    ReplyDelete
  4. Hi John,

    SimpleText codec ships with the "codecs" module (lucene-codecs).

    Also I fixed the link above...

    ReplyDelete
  5. Mike, I've had a blast playing around with Codecs lately. Thanks for SimpleText, its very educational. I posted a blog article that hopefully can help others understand the basics of how a search engine works based on SimpleText's output. Hope you/your readers enjoy.

    ReplyDelete
  6. Hi Doug,

    That's a great blog post! Thank you for sharing!

    ReplyDelete
  7. Hi Michael,
    I am new to this Codec concept.Can you post example for how to convert index files to readable format?.

    thanks,
    yuva

    ReplyDelete
    Replies
    1. Hi Yuva,

      Can you please ask on the Lucene user's list (java-user@lucene.apache.org)?

      Delete