Changing Bits: New index statistics in Lucene 4.0

Wednesday, March 14, 2012

New index statistics in Lucene 4.0

In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.

Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.

To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":

document 1: The Lion, the Witch, and the Wardrobe
document 2: The Da Vinci Code

Assume we tokenize on whitespace, commas are removed, all terms are downcased and we don't discard stop-words. Here are the statistics Lucene tracks:

TermsEnum.docFreq(): How many documents contain at least one occurrence of the term in the field; 3.x indices also save this (TermEnum.docFreq()). For term "lion" docFreq is 1, and for term "the" it's 2.
Terms.getSumDocFreq(): Number of postings, i.e. sum of TermsEnum.docFreq() across all terms in the field. For our example documents this is 9.
TermsEnum.totalTermFreq(): Number of occurrences of this term in the field, across all documents. For term "the" it's 4, for term "vinci" it's 1.
Terms.getSumTotalTermFreq(): Number of term occurrences in the field, across all documents; this is the sum of TermsEnum.totalTermFreq() across all unique terms in the field. For our example documents this is 11.
Terms.getDocCount(): How many documents have at least one term for this field. In our example documents, this is 2, but if for example one of the documents was missing the title field, it would be 1.
Terms.getUniqueTermCount(): How many unique terms were seen in this field. For our example documents this is 8. Note that this statistic is of limited utility for scoring, because it's only available per-segment and you cannot (efficiently!) compute this across all segments in the index (unless there is only one segment).
Fields.getUniqueTermCount(): Number of unique terms across all fields; this is the sum of Terms.getUniqueTermCount() across all fields. In our example documents this is 8. Note that this is also only available per-segment.
Fields.getUniqueFieldCount(): Number of unique fields. For our example documents this is 1; if we also had a body field and an abstract field, it would be 3. Note that this is also only available per-segment.

3.x indices only store TermsEnum.docFreq(), so if you want to experiment with the new scoring models in Lucene 4.0, you should either re-index or upgrade your index using IndexUpgrader. Note that the new scoring models all use the same single-byte norms format, so you can freely switch between them without re-indexing.

In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the FieldInvertState passed to Similarity.computeNorm method for both 3.x and 4.0:

length: How many tokens in the document. For document 1 it's 7; for document 2 it's 4.
uniqueTermCount: For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.
maxTermFrequency: What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.

In 3.x, if you want to consume these indexing-time statistics, you'll have to save them away yourself (e.g., somehow encoding them into the single-byte norm value). However, since 4.0 uses doc values for norms, you have more freedom to encode these statistics however you'd like. Your custom similarity can then pull from these.

From these available statistics you're now free to derive other commonly used statistics:

Average field length across all documents is Terms.getSumTotalTermFreq() divided by maxDoc (or Terms.getDocCount(), if not all documents have the field).
Average within-document field term frequency is FieldInvertState.length divided by FieldInvertState.uniqueTermCount.
Average number of unique terms per field across all documents is Terms.getSumDocFreq() divided by maxDoc (or Terms.getDocCount(field), if not all documents have the field).

Remember that the statistics do not reflect deleted documents, until those documents are merged away; in general this also means that segment merging will alter scores! Similarly, if the field omits term frequencies, then the statistics will not be correct (though they will still be consistent with one another: we will pretend each term occurred once per document).

17 comments:

John WangMarch 22, 2012 at 1:19 AM
This is super-awesome! Thanks Mike!
ReplyDelete
Replies
AnonymousAugust 23, 2012 at 2:52 PM
How can I get the count of a term for a given document and field?
ReplyDelete
Replies
Michael McCandlessAugust 23, 2012 at 6:56 PM
Hi Anonymous,

You should pull a DocsEnum for that field + term, advance to the docID and call .freq() to get that count. This is also available in term vectors (if you want to get the count for all terms within a single doc).
ReplyDelete
Replies
UnknownAugust 24, 2012 at 4:57 AM
Hi
I'm trying to figure out in Lucene 4 the document length (total number of keywords it contains) from the norm vector at query run-time. For some (unclear) reason the encode method of the norm values is not public.
Any way of doing it?
ReplyDelete
Replies
Michael McCandlessAugust 24, 2012 at 6:38 AM
Hi David,

Which similarity are you using? Just the default (DefaultSimilarity)? If so, then you can get the docLength from the norm by calling DefaultSimilarity.decodeNormValue. Note that if you boosted the field during indexing it will change this document length (higher boost makes doc length smaller). Also note that this is heavily quantized: we use only a single byte to encode the length by default.

If the quantization or mixin of boost is a problem ou can also make your own Similarity impl and store docLen yourself (eg as an int).
ReplyDelete
Replies
Michael McCandlessAugust 28, 2012 at 12:32 PM
Hi David,

I think it's actually an artifact from back when TFIDFSim was the only
Similarity in Lucene, that its decodeNormValue is still
public... really how the sim stores its stats (and what stats it
stores) are private implementation details to it.

If you really want the int per field X doc ... you could clone
BM25Sim, but then instead of norm.setByte(...) in computeNorm, use
norm.setInt (and then remove all the encode/decodeNormValue stuff).
Then in the Exact/Sloppy scorers, pull the int[] docLengths instead of
byte[] norms. The downside of course is 4X the RAM per field X doc.
ReplyDelete
Replies
hossein tahaniDecember 23, 2012 at 10:56 PM
Hi Mike,
How can I get the freq of a term in a document?
Thanks in advance.
ReplyDelete
Replies
Michael McCandlessDecember 25, 2012 at 5:37 AM
Hi hossein,

You need to first get a TermsEnum (Terms.iterator(null)), seek to that term (termsEnum.seekExact: verify you got back "true" indicating that the term exists), then use its .advance method to skip to that docID (verify the returned docID is the one you advanced to), then call freq().
ReplyDelete
Replies
AnonymousJanuary 29, 2013 at 11:38 AM
Hi Mike,

How could I get the number of the unique terms in a document in my Similarity impl?

Thanks in advance.
ReplyDelete
Replies
Michael McCandlessJanuary 29, 2013 at 1:35 PM
Anonymous,

That stat is available at indexing time, passed to your Similarity.computeNorm in the FieldInvertState argument, member uniqueTermCount.

If you need this at search time you'll have to save it away somewhere ...
ReplyDelete
Replies
UnknownAugust 20, 2013 at 1:48 AM
Hey Mike

I'm trying to understand what you explained but I'm pretty confused. I try to create the TermsEnum by the IndexReader but it always giving me null exception.
ReplyDelete
Replies
UnknownOctober 23, 2014 at 5:23 AM
Hi Mike,
I'd like to know if there is any Analyzer that (at indexing time) doesn't tokenize on white spaces, but allows me to manage multi word phrases as single terms so that the API described in this post will keep woking as expected. In my use case I've a set of phrases that I need TermsEnum.docFreq(). Is something possible? In the case there isn't anything like that, would be trivial to implement a custom analyzer? Many Thanks
ReplyDelete
Replies
UnknownDecember 10, 2017 at 2:59 AM
Hello, I calculate the average term frequency for each document in Similarity.computeNorm and encode it into my norm. However, I need also the corresponding mean value over all documents.

Does anybody have a suggestion how I could achieve this?
ReplyDelete
Replies

Add comment