Changing Bits: A new Lucene highlighter is born

Sunday, December 30, 2012

A new Lucene highlighter is born

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it's the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google's Chrome browser has an ingenious solution to the final inch problem, when you use "Find..." to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn't store the full token graph.

Unlike the existing highlighters, which rely on term-vectors or on re-analysis of each matched document to obtain the per-token offsets, PostingsHighlighter uses the recently added postings offsets feature. To index postings offsets you must set the field to be highlighted to use FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS option during indexing.

It turns out postings offsets is much more efficient storage for offsets because the default codec (currently Lucene41) does a good job compressing them: ~1.1 byte per position, which includes both start and end offset. In contrast, term vectors require substantially more disk space (~7.8X for the 10 million document English Wikipedia index), slow down indexing and merging, and are slow to access at search time. A smaller index also means the "working set" size, i.e. the net number of bytes that your search application frequently hits from disk, will be smaller, so you'll need less RAM to keep the index hot.

PostingsHighlighter uses a BreakIterator to find passages in the text; by default it breaks using getSentenceIterator. It then iterates in parallel (merge sorting by offset) through the positions of all terms from the query, coalescing those hits that occur in a single passage into a Passage, and then scores each Passage using a separate PassageScorer.

The scoring model is fun: it treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. The default PassageScorer uses BM25 scoring, biased with a normalization factor that favors passages occurring closer to the start of the document, but it's pluggable so you can implement your own scoring (and feel free to share if you find an improvement!).

This new highlighter should be substantially faster than our existing highlighters on a cold index (when the index doesn't fit entirely into available RAM), as it does more sequential IO instead of seek-heavy random access. Furthermore, as you increase the number of top hits, the performance gains should be even better. Also, the larger the documents the better the performance gains should be.

One known limitation is that it can only highlight a single field at a time, i.e. you cannot pass it N fields and have it pick the best passages across all of them, though both existing highlighters have the same limitation. The code is very new and may still have some exciting bugs! This is why it's located under Lucene's sandbox module.

If you are serious about highlighting in your search application (and you should be!) then PostingsHighlighter is well worth a look!

34 comments:

AnonymousJanuary 1, 2013 at 9:21 AM
Sounds very promising! I am curious though, how does this highlighter know sentences boundaries? I thought Lucene by default looses this info.
ReplyDelete
Replies
Michael McCandlessJanuary 1, 2013 at 7:13 PM
Hi Anonymous,

The new highlighter uses a BreakIterator, so you are free to pick whatever fragments you want. If you don't specify one it defaults to BreakIterator.getSentenceInstance(Locale.ROOT).
ReplyDelete
Replies
JuampaJanuary 8, 2013 at 4:51 AM
Sounds interesting. Can we use the SentenceDetector from OpenNLP project to pick the fragments ?
ReplyDelete
Replies
Michael McCandlessJanuary 8, 2013 at 8:27 AM
Hi Juampa,

Does OpenNLP's SentenceDetector implement BreakIterator? If so, then it's easy (just pass it when you create the PostingsHighlighter).

If not, you'd have to make your own BreakIterator that wraps SentenceDetector and bridges the two APIs.
ReplyDelete
Replies
Michael SokolovJanuary 24, 2013 at 9:35 AM
I'm super excited to see this come to fruition - thanks to Robert. We'll be giving it a whirl as soon as we can work it into our schedule.
ReplyDelete
Replies
Adrian PemselJanuary 31, 2013 at 10:08 AM
That's great work and I enjoy playing around with it so far. How would you solve the second step of the final inch problem? At the moment I fear I have to use a second index (where each page is a document) but that seams to be overkill - is there no way to get the offset into the text for each highlighting?
ReplyDelete
Replies
Michael McCandlessJanuary 31, 2013 at 2:08 PM
Hi Adrian,

You'd need 1) the reverse mapping from character offsets back to "locations" (ideally exact spot) in the rendered document, and 2) to work with whatever is used to render the document to the user (the browser for HTML, Acrobat reader for PDF, Word for word docs, etc.) to "jump to that spot and highlight it".

For 1) you'd need document filters that preserve "location in the original rendered document" somehow, and then you could eg populate a custom attribute during analysis, and re-analyzing the document (this isn't so bad because you're analyzing just the 1 doc that the user just clicked on), and then pull tokens until you're at the right offset (that was highlighted).

Eg, if your "location" is just page number, then you could have a PageNumberAttribute that increments whenever you cross to another page, and eg for PDF it's possible to jump to a particular page (http://acrobatusers.com/tutorials/can-i-hyperlink-specific-page-pdf-file ) for the second part.

But it's definitely tricky!
ReplyDelete
Replies
Adrian PemselFebruary 1, 2013 at 2:56 AM
Thank you so much. Just getting the character offset in the Solr response would be enough for me since I can get the character offset -> location mapping from my existing system for free. We are (mainly) indexing PDFs and the system I developed translates that to the text that is sent to Solr and a HTML version for display - essentially a PDF renderer outputing HTML tags... THAT was tricky ;-)
ReplyDelete
Replies
Michael McCandlessFebruary 2, 2013 at 8:44 AM
Hi Adrian,

Maybe you could share your cool system for mapping from char offset -> location :)

I'm not sure how to get the offsets from Solr, but hopefully this is possible! The underlying Lucene index clearly has it ...

Maybe email solr-user@lucene.apache.org?
ReplyDelete
Replies
Adrian PemselFebruary 4, 2013 at 5:25 AM
Thank you again Mike, I am going to ask on solr-user. In the meantime I am trying a crutch by just passing the whole snippet to my frontend code and searching for the text snippet myself...

Unfortunately I am probably unable to share much of the work on the PDF system as it is using a proprietary library for parsing the PDFs (PDFLibTet). The PDF renderer is actually written in PHP (...), you can see an example of the search and display by going to
http://www.jusmeum.de/suche?search%5Bquery%5D=verbraucher+widerrufsrecht+nachricht
and clicking one of the results (these are German legal judgments ;-) That is still using the regexp fragmenter for highlighting, but I am probably going to switch to the PostingsHighlighter since all my tests so far have been positive.
ReplyDelete
Replies
Michael McCandlessFebruary 4, 2013 at 6:44 AM
Wow, that UI is nice! How do you tokenize/decompound? (This is challenging in German!). I had Chrome translate to English and it properly translated the highlighted parts too :)

I wonder if the open-source PDF packages (eg Apache PDFBox) could be swapped in?

That's a good idea on the snippet ...

Thanks for sharing your PostingsHighlighter experience, and this is good news!
ReplyDelete
Replies
UnknownMarch 1, 2013 at 6:24 PM
I just discovered https://issues.apache.org/jira/browse/LUCENE-4641 which just rained pretty hard on my parade. If I can't have WordDelimiterFilterFactory I can't move forward ... which kills me because the current highlighter has been the cause of most of my search woes over the years.
ReplyDelete
Replies
Michael McCandlessMarch 3, 2013 at 3:24 PM
Unfortunately proper highlighting relies on the Analyzer producing correct offsets ... if the Analyzer is buggy then the highlights will be off, regardless of which highlighter impl you use ...
ReplyDelete
Replies
RonaldMarch 23, 2013 at 12:05 PM
Hey Mike, any sample code for implementation of Highlighter ?

Regards,
Ronald
ReplyDelete
Replies
Michael McCandlessMarch 25, 2013 at 3:15 PM
Hi Ronald,

Maybe have a look at the unit test? TestPostingsHighlighter.java
ReplyDelete
Replies
AnonymousMay 7, 2013 at 1:46 PM
Mike

Does the new highlighter work with SpanQuery?
ReplyDelete
Replies
Michael McCandlessMay 7, 2013 at 3:54 PM
Anonymous,

It should "work" in that no exception will be generated, but, this highlighter makes no guarantee that the snippets it shows you actually "match" the query. Same with other positional queries e.g. PhraseQuery. But in practice playing with it I'm not sure it often matters ... because snippets with many and diverse term matches score higher and so the ones that are shown are usually good matches.

Try it and see!
ReplyDelete
Replies
RonaldMay 20, 2013 at 8:18 AM
Hello Mike, i have gone through the code, am bit skeptic about the implementation, since L4, is new and now tutorials available yet, confused how to get the snippets /term positions from the index, i knew that is similar to 3.x, in L4 there is a big change, need to understand how to get the snippets and display the search results using highlighter...

Regards,
Ronald
ReplyDelete
Replies
Michael McCandlessMay 20, 2013 at 12:41 PM
Hi Ronald,

Probably the simplest way to see how to use PostingsHighlighter is to look at its unit tests: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/lucene/highlighter/src/test/org/apache/lucene/search/postingshighlight/
ReplyDelete
Replies
AlexeyMay 21, 2013 at 10:12 AM
Hi Michael,
Does this new highlighter support wildcard and fuzzy queries? From my tests, it's not capable of highlighting such matches.

Regards,
Alexey
ReplyDelete
Replies
Michael McCandlessMay 21, 2013 at 11:26 AM
Hi Alexey,

MultiTermQueries are tricky: PostingsHighlighter intentionally does nothing with them because it can be a performance trap.

One simple thing you can do is rewrite the query yourself up-front:

query = searcher.rewrite(query);

And then search and highlight with that query. The problem is, when a MTQ matches enough terms, it will rewrite to a filter and I believe no terms will be highlighted. You can change this by setting the rewrite method for the query, but ... this gets costly because the more term the highlighter (and BooleanQuery) must visit, the more CPU/IO is spent.

Honestly, when an MTQ matches many terms, I don't think highlighting is really so useful, so perhaps it's good that the filter won't highlight anything.
ReplyDelete
Replies
Alwyn SchoemanJune 26, 2013 at 2:40 PM
Hi Michael,

I'm working on a utility that will use a Lucene index to search files (mostly java and sql) instead of me scripting find/grep.

When using PostingsHighlighter the BreakIterator options doesn't currently allow me to get a whole line for a term, start till end of line.

Can you give me some advice on the best place to start implementing something like that? Eventually I'd even like to tell it to give me X lines before and/or after the match.
ReplyDelete
Replies
AnonymousAugust 16, 2013 at 8:55 AM
Hello Mike,
What data structure is internally used for this Highlighting purpose, specially the new one PostingsHighlighter?
ReplyDelete
Replies
SreeeSeptember 3, 2013 at 2:48 AM
Hi Mike,

Are there any ways out to control the length of the text appearing before and after the highlighted text?

Thanks & Regards,
Sreehareesh
ReplyDelete
Replies
BelayNovember 27, 2013 at 2:58 AM
Hello dears,
My question is not relevant to the topic under discussion but I hope I will get some help. I tried to use the Utils class by importing "org.apache.poi.hdf.extractor.Utils" as follows:
Utils util = new Utils();
util.displayTokens(new myAnalyzer(), searchWord);
However, the class is not in the lucene I am using (lucene 1.4.3) and I tried to search for a separate jar but could not get it. Pls forward your help.
ReplyDelete
Replies

Add comment