Sunday, December 30, 2012

A new Lucene highlighter is born

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it's the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google's Chrome browser has an ingenious solution to the final inch problem, when you use "Find..." to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn't store the full token graph.

Unlike the existing highlighters, which rely on term-vectors or on re-analysis of each matched document to obtain the per-token offsets, PostingsHighlighter uses the recently added postings offsets feature. To index postings offsets you must set the field to be highlighted to use FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS option during indexing.

It turns out postings offsets is much more efficient storage for offsets because the default codec (currently Lucene41) does a good job compressing them: ~1.1 byte per position, which includes both start and end offset. In contrast, term vectors require substantially more disk space (~7.8X for the 10 million document English Wikipedia index), slow down indexing and merging, and are slow to access at search time. A smaller index also means the "working set" size, i.e. the net number of bytes that your search application frequently hits from disk, will be smaller, so you'll need less RAM to keep the index hot.

PostingsHighlighter uses a BreakIterator to find passages in the text; by default it breaks using getSentenceIterator. It then iterates in parallel (merge sorting by offset) through the positions of all terms from the query, coalescing those hits that occur in a single passage into a Passage, and then scores each Passage using a separate PassageScorer.

The scoring model is fun: it treats the single original document as the whole corpus, and then scores individual passages as if they were documents in this corpus. The default PassageScorer uses BM25 scoring, biased with a normalization factor that favors passages occurring closer to the start of the document, but it's pluggable so you can implement your own scoring (and feel free to share if you find an improvement!).

This new highlighter should be substantially faster than our existing highlighters on a cold index (when the index doesn't fit entirely into available RAM), as it does more sequential IO instead of seek-heavy random access. Furthermore, as you increase the number of top hits, the performance gains should be even better. Also, the larger the documents the better the performance gains should be.

One known limitation is that it can only highlight a single field at a time, i.e. you cannot pass it N fields and have it pick the best passages across all of them, though both existing highlighters have the same limitation. The code is very new and may still have some exciting bugs! This is why it's located under Lucene's sandbox module.

If you are serious about highlighting in your search application (and you should be!) then PostingsHighlighter is well worth a look!

34 comments:

  1. Sounds very promising! I am curious though, how does this highlighter know sentences boundaries? I thought Lucene by default looses this info.

    ReplyDelete
  2. Hi Anonymous,

    The new highlighter uses a BreakIterator, so you are free to pick whatever fragments you want. If you don't specify one it defaults to BreakIterator.getSentenceInstance(Locale.ROOT).

    ReplyDelete
  3. Sounds interesting. Can we use the SentenceDetector from OpenNLP project to pick the fragments ?

    ReplyDelete
  4. Hi Juampa,

    Does OpenNLP's SentenceDetector implement BreakIterator? If so, then it's easy (just pass it when you create the PostingsHighlighter).

    If not, you'd have to make your own BreakIterator that wraps SentenceDetector and bridges the two APIs.

    ReplyDelete
  5. I'm super excited to see this come to fruition - thanks to Robert. We'll be giving it a whirl as soon as we can work it into our schedule.

    ReplyDelete
  6. That's great work and I enjoy playing around with it so far. How would you solve the second step of the final inch problem? At the moment I fear I have to use a second index (where each page is a document) but that seams to be overkill - is there no way to get the offset into the text for each highlighting?

    ReplyDelete
  7. Hi Adrian,

    You'd need 1) the reverse mapping from character offsets back to "locations" (ideally exact spot) in the rendered document, and 2) to work with whatever is used to render the document to the user (the browser for HTML, Acrobat reader for PDF, Word for word docs, etc.) to "jump to that spot and highlight it".

    For 1) you'd need document filters that preserve "location in the original rendered document" somehow, and then you could eg populate a custom attribute during analysis, and re-analyzing the document (this isn't so bad because you're analyzing just the 1 doc that the user just clicked on), and then pull tokens until you're at the right offset (that was highlighted).

    Eg, if your "location" is just page number, then you could have a PageNumberAttribute that increments whenever you cross to another page, and eg for PDF it's possible to jump to a particular page (http://acrobatusers.com/tutorials/can-i-hyperlink-specific-page-pdf-file ) for the second part.

    But it's definitely tricky!

    ReplyDelete
  8. Thank you so much. Just getting the character offset in the Solr response would be enough for me since I can get the character offset -> location mapping from my existing system for free. We are (mainly) indexing PDFs and the system I developed translates that to the text that is sent to Solr and a HTML version for display - essentially a PDF renderer outputing HTML tags... THAT was tricky ;-)

    ReplyDelete
  9. Hi Adrian,

    Maybe you could share your cool system for mapping from char offset -> location :)

    I'm not sure how to get the offsets from Solr, but hopefully this is possible! The underlying Lucene index clearly has it ...

    Maybe email solr-user@lucene.apache.org?

    ReplyDelete
  10. Thank you again Mike, I am going to ask on solr-user. In the meantime I am trying a crutch by just passing the whole snippet to my frontend code and searching for the text snippet myself...

    Unfortunately I am probably unable to share much of the work on the PDF system as it is using a proprietary library for parsing the PDFs (PDFLibTet). The PDF renderer is actually written in PHP (...), you can see an example of the search and display by going to
    http://www.jusmeum.de/suche?search%5Bquery%5D=verbraucher+widerrufsrecht+nachricht
    and clicking one of the results (these are German legal judgments ;-) That is still using the regexp fragmenter for highlighting, but I am probably going to switch to the PostingsHighlighter since all my tests so far have been positive.

    ReplyDelete
  11. Wow, that UI is nice! How do you tokenize/decompound? (This is challenging in German!). I had Chrome translate to English and it properly translated the highlighted parts too :)

    I wonder if the open-source PDF packages (eg Apache PDFBox) could be swapped in?

    That's a good idea on the snippet ...

    Thanks for sharing your PostingsHighlighter experience, and this is good news!

    ReplyDelete
  12. I just discovered https://issues.apache.org/jira/browse/LUCENE-4641 which just rained pretty hard on my parade. If I can't have WordDelimiterFilterFactory I can't move forward ... which kills me because the current highlighter has been the cause of most of my search woes over the years.

    ReplyDelete
  13. Unfortunately proper highlighting relies on the Analyzer producing correct offsets ... if the Analyzer is buggy then the highlights will be off, regardless of which highlighter impl you use ...

    ReplyDelete
  14. Hey Mike, any sample code for implementation of Highlighter ?

    Regards,
    Ronald

    ReplyDelete
  15. Hi Ronald,

    Maybe have a look at the unit test? TestPostingsHighlighter.java

    ReplyDelete
  16. Mike

    Does the new highlighter work with SpanQuery?

    ReplyDelete
  17. Anonymous,

    It should "work" in that no exception will be generated, but, this highlighter makes no guarantee that the snippets it shows you actually "match" the query. Same with other positional queries e.g. PhraseQuery. But in practice playing with it I'm not sure it often matters ... because snippets with many and diverse term matches score higher and so the ones that are shown are usually good matches.

    Try it and see!

    ReplyDelete
  18. Hello Mike, i have gone through the code, am bit skeptic about the implementation, since L4, is new and now tutorials available yet, confused how to get the snippets /term positions from the index, i knew that is similar to 3.x, in L4 there is a big change, need to understand how to get the snippets and display the search results using highlighter...

    Regards,
    Ronald

    ReplyDelete
  19. Hi Ronald,

    Probably the simplest way to see how to use PostingsHighlighter is to look at its unit tests: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/lucene/highlighter/src/test/org/apache/lucene/search/postingshighlight/

    ReplyDelete
  20. Hi Michael,
    Does this new highlighter support wildcard and fuzzy queries? From my tests, it's not capable of highlighting such matches.

    Regards,
    Alexey

    ReplyDelete
  21. Hi Alexey,

    MultiTermQueries are tricky: PostingsHighlighter intentionally does nothing with them because it can be a performance trap.

    One simple thing you can do is rewrite the query yourself up-front:

    query = searcher.rewrite(query);

    And then search and highlight with that query. The problem is, when a MTQ matches enough terms, it will rewrite to a filter and I believe no terms will be highlighted. You can change this by setting the rewrite method for the query, but ... this gets costly because the more term the highlighter (and BooleanQuery) must visit, the more CPU/IO is spent.

    Honestly, when an MTQ matches many terms, I don't think highlighting is really so useful, so perhaps it's good that the filter won't highlight anything.

    ReplyDelete
  22. Hi Michael,

    I'm working on a utility that will use a Lucene index to search files (mostly java and sql) instead of me scripting find/grep.

    When using PostingsHighlighter the BreakIterator options doesn't currently allow me to get a whole line for a term, start till end of line.

    Can you give me some advice on the best place to start implementing something like that? Eventually I'd even like to tell it to give me X lines before and/or after the match.

    ReplyDelete
    Replies
    1. Hi Alwyn,

      I sounds like you'll need to make a custom BreakIterator, that breaks by newline (if such a thing doesn't already exist somewhere!). But I think you should email the Lucene user's list (java-user@lucene.apache.org) to see if there are other ideas?

      Delete
  23. Hello Mike,
    What data structure is internally used for this Highlighting purpose, specially the new one PostingsHighlighter?

    ReplyDelete
    Replies
    1. Hi Anonymous,

      Better to ask questions like this on dev@lucene.apache.org...

      PostingsHighlighter uses a priority queue to visit all matches for a given document in offsets order.

      Delete
  24. Hi Mike,

    Are there any ways out to control the length of the text appearing before and after the highlighted text?

    Thanks & Regards,
    Sreehareesh

    ReplyDelete
    Replies
    1. Hi Sreee,

      Your BreakIterator does this, when it breaks the content into "sentences". PostingsHighlighter then returns the sentence that had the match.

      Delete
    2. Hi Mike,

      Thanks for the quick answer.

      But I missed out to mention something in my question :(

      I'm using Solr version 1.4. I think BreakIterator comes along with FastVectorHighlighter which is not supported in this version.
      In this case are there any ways to achieve the goal?

      One more issue I'm facing with the highlighting. I set the fragment length to 500(hl.fragsize=500). But the returned search result varies in length in greater extends (like 408 or 520). No slop is defined here. Is this an expected behavior or can it be exactly 500?

      Thanks & Regards,
      Sreehareesh

      Delete
    3. Hi Sreee,

      I'm not sure how BreakIterator is exposed via Solr, and I'm also uncertain how the older highlighters interpret the hl.fragsize; can you send an email to solr-user@lucene.apache.org to ask these questions?

      Delete
    4. Hi Mike,

      I just sent out a mail and waiting for the reply.

      Thanks & Regards,
      Sreehareesh

      Delete
    5. Hi Mike,

      Just FYI, here is the link to the response from Apache Lucene http://lucene.472066.n3.nabble.com/Solr-highlighting-fragment-issue-td4088208.html

      Thanks & Regards,
      Sreehareesh

      Delete
  25. Hello dears,
    My question is not relevant to the topic under discussion but I hope I will get some help. I tried to use the Utils class by importing "org.apache.poi.hdf.extractor.Utils" as follows:
    Utils util = new Utils();
    util.displayTokens(new myAnalyzer(), searchWord);
    However, the class is not in the lucene I am using (lucene 1.4.3) and I tried to search for a separate jar but could not get it. Pls forward your help.

    ReplyDelete
    Replies
    1. Lucene 1.4.3 is truly ancient.

      It sounds like you are using Apache POI, to extract tokens? Maybe send an email to the POI user's list? (user@poi.apache.org)

      Delete
    2. Actually I tried to use different versions of Lucene (up to 4.0.0) and I tried to search for a separate jar file but could not get one. I got the code from an old program but there is no any import to support it.
      Utils util = new Utils();
      util.displayTokens(new myAnalyzer(), searchWord);

      Now, as per to your advice I sent email to (user@poi.apache.org). I really appreciate for you kind support.

      Delete