PostingsHighlighter, our third highlighter implementation (
FastVectorHighlighterare the existing ones). It will be available starting in the upcoming 4.1 release.
Highlighting is crucial functionality in most search applications since it's the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).
Google's Chrome browser has an ingenious solution to the final inch problem, when you use "Find..." to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!
All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the
OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using
SynonymFilteris also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn't store the full token graph.
Unlike the existing highlighters, which rely on term-vectors or on re-analysis of each matched document to obtain the per-token offsets,
PostingsHighlighteruses the recently added postings offsets feature. To index postings offsets you must set the field to be highlighted to use
FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSoption during indexing.
It turns out postings offsets is much more efficient storage for offsets because the default codec (currently
Lucene41) does a good job compressing them: ~1.1 byte per position, which includes both start and end offset. In contrast, term vectors require substantially more disk space (~7.8X for the 10 million document English Wikipedia index), slow down indexing and merging, and are slow to access at search time. A smaller index also means the "working set" size, i.e. the net number of bytes that your search application frequently hits from disk, will be smaller, so you'll need less RAM to keep the index hot.
BreakIteratorto find passages in the text; by default it breaks using
getSentenceIterator. It then iterates in parallel (merge sorting by offset) through the positions of all terms from the query, coalescing those hits that occur in a single passage into a
Passage, and then scores each
Passageusing a separate
The scoring model is fun: it treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. The default
PassageScoreruses BM25 scoring, biased with a normalization factor that favors passages occurring closer to the start of the document, but it's pluggable so you can implement your own scoring (and feel free to share if you find an improvement!).
This new highlighter should be substantially faster than our existing highlighters on a cold index (when the index doesn't fit entirely into available RAM), as it does more sequential IO instead of seek-heavy random access. Furthermore, as you increase the number of top hits, the performance gains should be even better. Also, the larger the documents the better the performance gains should be.
One known limitation is that it can only highlight a single field at a time, i.e. you cannot pass it N fields and have it pick the best passages across all of them, though both existing highlighters have the same limitation. The code is very new and may still have some exciting bugs! This is why it's located under Lucene's
If you are serious about highlighting in your search application (and you should be!) then
PostingsHighlighteris well worth a look!