PostingsHighlighter
, our
third highlighter implementation (Highlighter
and FastVectorHighlighter
are the existing ones). It
will be available starting in the upcoming 4.1 release.
Highlighting is crucial functionality in most search applications since it's the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).
Google's Chrome browser has an ingenious solution to the final inch problem, when you use "Find..." to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!
All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the
OffsetAttribute
,
though some analyzers and token filters are known to mess up offsets
which will lead to incorrect highlights or exceptions during
highlighting. Highlighting while using SynonymFilter
is
also problematic in certain cases, for example when a rule maps
multiple input tokens to multiple output tokens, because the Lucene
index doesn't
store the full token graph.
Unlike the existing highlighters, which rely on term-vectors or on re-analysis of each matched document to obtain the per-token offsets,
PostingsHighlighter
uses
the recently
added postings offsets feature. To index postings
offsets you must set the field to be highlighted to
use FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
option during indexing.
It turns out postings offsets is much more efficient storage for offsets because the default codec (currently
Lucene41
)
does a good job compressing them: ~1.1 byte per position, which
includes both start and end offset. In contrast, term vectors require
substantially more disk space (~7.8X for the 10 million document
English Wikipedia index), slow down indexing and merging, and are slow
to access at search time. A smaller index also means the "working
set" size, i.e. the net number of bytes that your search application
frequently hits from disk, will be smaller, so you'll need less RAM to
keep the index hot.
PostingsHighlighter
uses
a BreakIterator
to find passages in the text; by default it breaks
using getSentenceIterator
. It then iterates in parallel
(merge sorting by offset) through the positions of all terms from the
query, coalescing those hits that occur in a single passage into
a Passage
, and then scores each Passage
using a separate PassageScorer
.
The scoring model is fun: it treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. The default
PassageScorer
uses BM25
scoring, biased with a normalization factor that favors passages
occurring closer to the start of the document, but it's pluggable so
you can implement your own scoring (and feel free to share if you find
an improvement!).
This new highlighter should be substantially faster than our existing highlighters on a cold index (when the index doesn't fit entirely into available RAM), as it does more sequential IO instead of seek-heavy random access. Furthermore, as you increase the number of top hits, the performance gains should be even better. Also, the larger the documents the better the performance gains should be.
One known limitation is that it can only highlight a single field at a time, i.e. you cannot pass it N fields and have it pick the best passages across all of them, though both existing highlighters have the same limitation. The code is very new and may still have some exciting bugs! This is why it's located under Lucene's
sandbox
module.
If you are serious about highlighting in your search application (and you should be!) then
PostingsHighlighter
is well worth a look!