PostingsHighlighter
, our
third highlighter implementation (Highlighter
and FastVectorHighlighter
are the existing ones). It
will be available starting in the upcoming 4.1 release.
Highlighting is crucial functionality in most search applications since it's the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).
Google's Chrome browser has an ingenious solution to the final inch problem, when you use "Find..." to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!
All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the
OffsetAttribute
,
though some analyzers and token filters are known to mess up offsets
which will lead to incorrect highlights or exceptions during
highlighting. Highlighting while using SynonymFilter
is
also problematic in certain cases, for example when a rule maps
multiple input tokens to multiple output tokens, because the Lucene
index doesn't
store the full token graph.
Unlike the existing highlighters, which rely on term-vectors or on re-analysis of each matched document to obtain the per-token offsets,
PostingsHighlighter
uses
the recently
added postings offsets feature. To index postings
offsets you must set the field to be highlighted to
use FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
option during indexing.
It turns out postings offsets is much more efficient storage for offsets because the default codec (currently
Lucene41
)
does a good job compressing them: ~1.1 byte per position, which
includes both start and end offset. In contrast, term vectors require
substantially more disk space (~7.8X for the 10 million document
English Wikipedia index), slow down indexing and merging, and are slow
to access at search time. A smaller index also means the "working
set" size, i.e. the net number of bytes that your search application
frequently hits from disk, will be smaller, so you'll need less RAM to
keep the index hot.
PostingsHighlighter
uses
a BreakIterator
to find passages in the text; by default it breaks
using getSentenceIterator
. It then iterates in parallel
(merge sorting by offset) through the positions of all terms from the
query, coalescing those hits that occur in a single passage into
a Passage
, and then scores each Passage
using a separate PassageScorer
.
The scoring model is fun: it treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. The default
PassageScorer
uses BM25
scoring, biased with a normalization factor that favors passages
occurring closer to the start of the document, but it's pluggable so
you can implement your own scoring (and feel free to share if you find
an improvement!).
This new highlighter should be substantially faster than our existing highlighters on a cold index (when the index doesn't fit entirely into available RAM), as it does more sequential IO instead of seek-heavy random access. Furthermore, as you increase the number of top hits, the performance gains should be even better. Also, the larger the documents the better the performance gains should be.
One known limitation is that it can only highlight a single field at a time, i.e. you cannot pass it N fields and have it pick the best passages across all of them, though both existing highlighters have the same limitation. The code is very new and may still have some exciting bugs! This is why it's located under Lucene's
sandbox
module.
If you are serious about highlighting in your search application (and you should be!) then
PostingsHighlighter
is well worth a look!
Sounds very promising! I am curious though, how does this highlighter know sentences boundaries? I thought Lucene by default looses this info.
ReplyDeleteHi Anonymous,
ReplyDeleteThe new highlighter uses a BreakIterator, so you are free to pick whatever fragments you want. If you don't specify one it defaults to BreakIterator.getSentenceInstance(Locale.ROOT).
Sounds interesting. Can we use the SentenceDetector from OpenNLP project to pick the fragments ?
ReplyDeleteHi Juampa,
ReplyDeleteDoes OpenNLP's SentenceDetector implement BreakIterator? If so, then it's easy (just pass it when you create the PostingsHighlighter).
If not, you'd have to make your own BreakIterator that wraps SentenceDetector and bridges the two APIs.
I'm super excited to see this come to fruition - thanks to Robert. We'll be giving it a whirl as soon as we can work it into our schedule.
ReplyDeleteThat's great work and I enjoy playing around with it so far. How would you solve the second step of the final inch problem? At the moment I fear I have to use a second index (where each page is a document) but that seams to be overkill - is there no way to get the offset into the text for each highlighting?
ReplyDeleteHi Adrian,
ReplyDeleteYou'd need 1) the reverse mapping from character offsets back to "locations" (ideally exact spot) in the rendered document, and 2) to work with whatever is used to render the document to the user (the browser for HTML, Acrobat reader for PDF, Word for word docs, etc.) to "jump to that spot and highlight it".
For 1) you'd need document filters that preserve "location in the original rendered document" somehow, and then you could eg populate a custom attribute during analysis, and re-analyzing the document (this isn't so bad because you're analyzing just the 1 doc that the user just clicked on), and then pull tokens until you're at the right offset (that was highlighted).
Eg, if your "location" is just page number, then you could have a PageNumberAttribute that increments whenever you cross to another page, and eg for PDF it's possible to jump to a particular page (http://acrobatusers.com/tutorials/can-i-hyperlink-specific-page-pdf-file ) for the second part.
But it's definitely tricky!
Thank you so much. Just getting the character offset in the Solr response would be enough for me since I can get the character offset -> location mapping from my existing system for free. We are (mainly) indexing PDFs and the system I developed translates that to the text that is sent to Solr and a HTML version for display - essentially a PDF renderer outputing HTML tags... THAT was tricky ;-)
ReplyDeleteHi Adrian,
ReplyDeleteMaybe you could share your cool system for mapping from char offset -> location :)
I'm not sure how to get the offsets from Solr, but hopefully this is possible! The underlying Lucene index clearly has it ...
Maybe email solr-user@lucene.apache.org?
Thank you again Mike, I am going to ask on solr-user. In the meantime I am trying a crutch by just passing the whole snippet to my frontend code and searching for the text snippet myself...
ReplyDeleteUnfortunately I am probably unable to share much of the work on the PDF system as it is using a proprietary library for parsing the PDFs (PDFLibTet). The PDF renderer is actually written in PHP (...), you can see an example of the search and display by going to
http://www.jusmeum.de/suche?search%5Bquery%5D=verbraucher+widerrufsrecht+nachricht
and clicking one of the results (these are German legal judgments ;-) That is still using the regexp fragmenter for highlighting, but I am probably going to switch to the PostingsHighlighter since all my tests so far have been positive.
Wow, that UI is nice! How do you tokenize/decompound? (This is challenging in German!). I had Chrome translate to English and it properly translated the highlighted parts too :)
ReplyDeleteI wonder if the open-source PDF packages (eg Apache PDFBox) could be swapped in?
That's a good idea on the snippet ...
Thanks for sharing your PostingsHighlighter experience, and this is good news!
I just discovered https://issues.apache.org/jira/browse/LUCENE-4641 which just rained pretty hard on my parade. If I can't have WordDelimiterFilterFactory I can't move forward ... which kills me because the current highlighter has been the cause of most of my search woes over the years.
ReplyDeleteUnfortunately proper highlighting relies on the Analyzer producing correct offsets ... if the Analyzer is buggy then the highlights will be off, regardless of which highlighter impl you use ...
ReplyDeleteHey Mike, any sample code for implementation of Highlighter ?
ReplyDeleteRegards,
Ronald
Hi Ronald,
ReplyDeleteMaybe have a look at the unit test? TestPostingsHighlighter.java
Mike
ReplyDeleteDoes the new highlighter work with SpanQuery?
Anonymous,
ReplyDeleteIt should "work" in that no exception will be generated, but, this highlighter makes no guarantee that the snippets it shows you actually "match" the query. Same with other positional queries e.g. PhraseQuery. But in practice playing with it I'm not sure it often matters ... because snippets with many and diverse term matches score higher and so the ones that are shown are usually good matches.
Try it and see!
Hello Mike, i have gone through the code, am bit skeptic about the implementation, since L4, is new and now tutorials available yet, confused how to get the snippets /term positions from the index, i knew that is similar to 3.x, in L4 there is a big change, need to understand how to get the snippets and display the search results using highlighter...
ReplyDeleteRegards,
Ronald
Hi Ronald,
ReplyDeleteProbably the simplest way to see how to use PostingsHighlighter is to look at its unit tests: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/lucene/highlighter/src/test/org/apache/lucene/search/postingshighlight/
Hi Michael,
ReplyDeleteDoes this new highlighter support wildcard and fuzzy queries? From my tests, it's not capable of highlighting such matches.
Regards,
Alexey
Hi Alexey,
ReplyDeleteMultiTermQueries are tricky: PostingsHighlighter intentionally does nothing with them because it can be a performance trap.
One simple thing you can do is rewrite the query yourself up-front:
query = searcher.rewrite(query);
And then search and highlight with that query. The problem is, when a MTQ matches enough terms, it will rewrite to a filter and I believe no terms will be highlighted. You can change this by setting the rewrite method for the query, but ... this gets costly because the more term the highlighter (and BooleanQuery) must visit, the more CPU/IO is spent.
Honestly, when an MTQ matches many terms, I don't think highlighting is really so useful, so perhaps it's good that the filter won't highlight anything.
Hi Michael,
ReplyDeleteI'm working on a utility that will use a Lucene index to search files (mostly java and sql) instead of me scripting find/grep.
When using PostingsHighlighter the BreakIterator options doesn't currently allow me to get a whole line for a term, start till end of line.
Can you give me some advice on the best place to start implementing something like that? Eventually I'd even like to tell it to give me X lines before and/or after the match.
Hi Alwyn,
DeleteI sounds like you'll need to make a custom BreakIterator, that breaks by newline (if such a thing doesn't already exist somewhere!). But I think you should email the Lucene user's list (java-user@lucene.apache.org) to see if there are other ideas?
Hello Mike,
ReplyDeleteWhat data structure is internally used for this Highlighting purpose, specially the new one PostingsHighlighter?
Hi Anonymous,
DeleteBetter to ask questions like this on dev@lucene.apache.org...
PostingsHighlighter uses a priority queue to visit all matches for a given document in offsets order.
Hi Mike,
ReplyDeleteAre there any ways out to control the length of the text appearing before and after the highlighted text?
Thanks & Regards,
Sreehareesh
Hi Sreee,
DeleteYour BreakIterator does this, when it breaks the content into "sentences". PostingsHighlighter then returns the sentence that had the match.
Hi Mike,
DeleteThanks for the quick answer.
But I missed out to mention something in my question :(
I'm using Solr version 1.4. I think BreakIterator comes along with FastVectorHighlighter which is not supported in this version.
In this case are there any ways to achieve the goal?
One more issue I'm facing with the highlighting. I set the fragment length to 500(hl.fragsize=500). But the returned search result varies in length in greater extends (like 408 or 520). No slop is defined here. Is this an expected behavior or can it be exactly 500?
Thanks & Regards,
Sreehareesh
Hi Sreee,
DeleteI'm not sure how BreakIterator is exposed via Solr, and I'm also uncertain how the older highlighters interpret the hl.fragsize; can you send an email to solr-user@lucene.apache.org to ask these questions?
Hi Mike,
DeleteI just sent out a mail and waiting for the reply.
Thanks & Regards,
Sreehareesh
Hi Mike,
DeleteJust FYI, here is the link to the response from Apache Lucene http://lucene.472066.n3.nabble.com/Solr-highlighting-fragment-issue-td4088208.html
Thanks & Regards,
Sreehareesh
Hello dears,
ReplyDeleteMy question is not relevant to the topic under discussion but I hope I will get some help. I tried to use the Utils class by importing "org.apache.poi.hdf.extractor.Utils" as follows:
Utils util = new Utils();
util.displayTokens(new myAnalyzer(), searchWord);
However, the class is not in the lucene I am using (lucene 1.4.3) and I tried to search for a separate jar but could not get it. Pls forward your help.
Lucene 1.4.3 is truly ancient.
DeleteIt sounds like you are using Apache POI, to extract tokens? Maybe send an email to the POI user's list? (user@poi.apache.org)
Actually I tried to use different versions of Lucene (up to 4.0.0) and I tried to search for a separate jar file but could not get one. I got the code from an old program but there is no any import to support it.
DeleteUtils util = new Utils();
util.displayTokens(new myAnalyzer(), searchWord);
Now, as per to your advice I sent email to (user@poi.apache.org). I really appreciate for you kind support.