tag:blogger.com,1999:blog-8623074010562846957.post1619844515255923603..comments2023-09-01T03:38:08.236-04:00Comments on Changing Bits: A new Lucene highlighter is bornMichael McCandlesshttp://www.blogger.com/profile/04277432937861334672noreply@blogger.comBlogger34125tag:blogger.com,1999:blog-8623074010562846957.post-68494732121345588412013-12-02T02:14:20.723-05:002013-12-02T02:14:20.723-05:00Actually I tried to use different versions of Luce...Actually I tried to use different versions of Lucene (up to 4.0.0) and I tried to search for a separate jar file but could not get one. I got the code from an old program but there is no any import to support it.<br />Utils util = new Utils();<br />util.displayTokens(new myAnalyzer(), searchWord);<br /><br />Now, as per to your advice I sent email to (user@poi.apache.org). I really appreciate for you kind support.Belayhttps://www.blogger.com/profile/11657810132098401880noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-22802536659766786372013-11-27T05:54:42.298-05:002013-11-27T05:54:42.298-05:00Lucene 1.4.3 is truly ancient.
It sounds like you...Lucene 1.4.3 is truly ancient.<br /><br />It sounds like you are using Apache POI, to extract tokens? Maybe send an email to the POI user's list? (user@poi.apache.org)Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-90528995047727275952013-11-27T02:58:22.951-05:002013-11-27T02:58:22.951-05:00Hello dears,
My question is not relevant to the to... Hello dears,<br />My question is not relevant to the topic under discussion but I hope I will get some help. I tried to use the Utils class by importing "org.apache.poi.hdf.extractor.Utils" as follows:<br /> Utils util = new Utils();<br /> util.displayTokens(new myAnalyzer(), searchWord);<br />However, the class is not in the lucene I am using (lucene 1.4.3) and I tried to search for a separate jar but could not get it. Pls forward your help.<br />Belayhttps://www.blogger.com/profile/11657810132098401880noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-29572323546835813822013-09-05T00:10:42.190-04:002013-09-05T00:10:42.190-04:00Hi Mike,
Just FYI, here is the link to the respo...Hi Mike, <br /><br />Just FYI, here is the link to the response from Apache Lucene http://lucene.472066.n3.nabble.com/Solr-highlighting-fragment-issue-td4088208.html<br /><br />Thanks & Regards,<br />SreehareeshSreeehttps://www.blogger.com/profile/00648452266029861465noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-47220538319828638282013-09-04T07:50:11.261-04:002013-09-04T07:50:11.261-04:00Hi Mike,
I just sent out a mail and waiting for t...Hi Mike,<br /><br />I just sent out a mail and waiting for the reply.<br /><br />Thanks & Regards,<br />SreehareeshSreeehttps://www.blogger.com/profile/00648452266029861465noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-46512993124321963022013-09-04T06:53:34.943-04:002013-09-04T06:53:34.943-04:00Hi Sreee,
I'm not sure how BreakIterator is e...Hi Sreee,<br /><br />I'm not sure how BreakIterator is exposed via Solr, and I'm also uncertain how the older highlighters interpret the hl.fragsize; can you send an email to solr-user@lucene.apache.org to ask these questions?Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-69725237052767730262013-09-03T23:40:29.571-04:002013-09-03T23:40:29.571-04:00Hi Mike,
Thanks for the quick answer.
But I miss...Hi Mike,<br /><br />Thanks for the quick answer.<br /><br />But I missed out to mention something in my question :(<br /><br />I'm using Solr version 1.4. I think BreakIterator comes along with FastVectorHighlighter which is not supported in this version. <br />In this case are there any ways to achieve the goal? <br /><br />One more issue I'm facing with the highlighting. I set the fragment length to 500(hl.fragsize=500). But the returned search result varies in length in greater extends (like 408 or 520). No slop is defined here. Is this an expected behavior or can it be exactly 500?<br /><br />Thanks & Regards,<br />SreehareeshSreeehttps://www.blogger.com/profile/00648452266029861465noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-78635519127269416292013-09-03T07:42:24.145-04:002013-09-03T07:42:24.145-04:00Hi Sreee,
Your BreakIterator does this, when it b...Hi Sreee,<br /><br />Your BreakIterator does this, when it breaks the content into "sentences". PostingsHighlighter then returns the sentence that had the match.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-31038719813691244822013-09-03T02:48:04.672-04:002013-09-03T02:48:04.672-04:00Hi Mike,
Are there any ways out to control the le...Hi Mike,<br /><br />Are there any ways out to control the length of the text appearing before and after the highlighted text?<br /><br />Thanks & Regards,<br />SreehareeshSreeehttps://www.blogger.com/profile/00648452266029861465noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-6786107339806306842013-08-17T07:14:02.770-04:002013-08-17T07:14:02.770-04:00Hi Anonymous,
Better to ask questions like this o...Hi Anonymous,<br /><br />Better to ask questions like this on dev@lucene.apache.org...<br /><br />PostingsHighlighter uses a priority queue to visit all matches for a given document in offsets order.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-84871035221987266242013-08-16T08:55:24.464-04:002013-08-16T08:55:24.464-04:00Hello Mike,
What data structure is internally used...Hello Mike,<br />What data structure is internally used for this Highlighting purpose, specially the new one PostingsHighlighter?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-77560380680126627002013-06-27T06:53:41.393-04:002013-06-27T06:53:41.393-04:00Hi Alwyn,
I sounds like you'll need to make a...Hi Alwyn,<br /><br />I sounds like you'll need to make a custom BreakIterator, that breaks by newline (if such a thing doesn't already exist somewhere!). But I think you should email the Lucene user's list (java-user@lucene.apache.org) to see if there are other ideas?Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-29539577005683098192013-06-26T14:40:40.802-04:002013-06-26T14:40:40.802-04:00Hi Michael,
I'm working on a utility that wil...Hi Michael,<br /><br />I'm working on a utility that will use a Lucene index to search files (mostly java and sql) instead of me scripting find/grep.<br /><br />When using PostingsHighlighter the BreakIterator options doesn't currently allow me to get a whole line for a term, start till end of line.<br /><br />Can you give me some advice on the best place to start implementing something like that? Eventually I'd even like to tell it to give me X lines before and/or after the match.Alwyn Schoemanhttps://www.blogger.com/profile/05609413675047129006noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-26236202717925288322013-05-21T11:26:38.540-04:002013-05-21T11:26:38.540-04:00Hi Alexey,
MultiTermQueries are tricky: PostingsH...Hi Alexey,<br /><br />MultiTermQueries are tricky: PostingsHighlighter intentionally does nothing with them because it can be a performance trap.<br /><br />One simple thing you can do is rewrite the query yourself up-front:<br /><br /> query = searcher.rewrite(query);<br /><br />And then search and highlight with that query. The problem is, when a MTQ matches enough terms, it will rewrite to a filter and I believe no terms will be highlighted. You can change this by setting the rewrite method for the query, but ... this gets costly because the more term the highlighter (and BooleanQuery) must visit, the more CPU/IO is spent.<br /><br />Honestly, when an MTQ matches many terms, I don't think highlighting is really so useful, so perhaps it's good that the filter won't highlight anything.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-88841474419443181252013-05-21T10:12:55.862-04:002013-05-21T10:12:55.862-04:00Hi Michael,
Does this new highlighter support wild...Hi Michael,<br />Does this new highlighter support wildcard and fuzzy queries? From my tests, it's not capable of highlighting such matches.<br /><br />Regards,<br />AlexeyAlexeyhttps://www.blogger.com/profile/00916763886742538480noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-13477774146899627412013-05-20T12:41:04.162-04:002013-05-20T12:41:04.162-04:00Hi Ronald,
Probably the simplest way to see how t...Hi Ronald,<br /><br />Probably the simplest way to see how to use PostingsHighlighter is to look at its unit tests: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_3/lucene/highlighter/src/test/org/apache/lucene/search/postingshighlight/Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-83916425240519057892013-05-20T08:18:21.352-04:002013-05-20T08:18:21.352-04:00Hello Mike, i have gone through the code, am bit s...Hello Mike, i have gone through the code, am bit skeptic about the implementation, since L4, is new and now tutorials available yet, confused how to get the snippets /term positions from the index, i knew that is similar to 3.x, in L4 there is a big change, need to understand how to get the snippets and display the search results using highlighter...<br /><br />Regards,<br />Ronald Ronaldhttp://nanoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-45606575723730837312013-05-07T15:54:47.495-04:002013-05-07T15:54:47.495-04:00Anonymous,
It should "work" in that no ...Anonymous,<br /><br />It should "work" in that no exception will be generated, but, this highlighter makes no guarantee that the snippets it shows you actually "match" the query. Same with other positional queries e.g. PhraseQuery. But in practice playing with it I'm not sure it often matters ... because snippets with many and diverse term matches score higher and so the ones that are shown are usually good matches.<br /><br />Try it and see!<br />Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-632646352665324612013-05-07T13:46:29.209-04:002013-05-07T13:46:29.209-04:00Mike
Does the new highlighter work with SpanQuery...Mike<br /><br />Does the new highlighter work with SpanQuery?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-58542707616379124832013-03-25T15:15:09.872-04:002013-03-25T15:15:09.872-04:00Hi Ronald,
Maybe have a look at the unit test? T...Hi Ronald,<br /><br />Maybe have a look at the unit test? TestPostingsHighlighter.javaMichael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-90425376834514435542013-03-23T12:05:59.177-04:002013-03-23T12:05:59.177-04:00Hey Mike, any sample code for implementation of Hi...Hey Mike, any sample code for implementation of Highlighter ?<br /><br />Regards,<br />RonaldRonaldnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-26743372561676838842013-03-03T15:24:54.455-05:002013-03-03T15:24:54.455-05:00Unfortunately proper highlighting relies on the An...Unfortunately proper highlighting relies on the Analyzer producing correct offsets ... if the Analyzer is buggy then the highlights will be off, regardless of which highlighter impl you use ...Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-79235353874569061952013-03-01T18:24:46.570-05:002013-03-01T18:24:46.570-05:00I just discovered https://issues.apache.org/jira/b...I just discovered https://issues.apache.org/jira/browse/LUCENE-4641 which just rained pretty hard on my parade. If I can't have WordDelimiterFilterFactory I can't move forward ... which kills me because the current highlighter has been the cause of most of my search woes over the years.Anonymoushttps://www.blogger.com/profile/10361368056734936463noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-72181459746027783012013-02-04T06:44:46.123-05:002013-02-04T06:44:46.123-05:00Wow, that UI is nice! How do you tokenize/decompo...Wow, that UI is nice! How do you tokenize/decompound? (This is challenging in German!). I had Chrome translate to English and it properly translated the highlighted parts too :)<br /><br />I wonder if the open-source PDF packages (eg Apache PDFBox) could be swapped in?<br /><br />That's a good idea on the snippet ...<br /><br />Thanks for sharing your PostingsHighlighter experience, and this is good news!Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-68085491614832814572013-02-04T05:25:14.143-05:002013-02-04T05:25:14.143-05:00Thank you again Mike, I am going to ask on solr-us...Thank you again Mike, I am going to ask on solr-user. In the meantime I am trying a crutch by just passing the whole snippet to my frontend code and searching for the text snippet myself...<br /><br />Unfortunately I am probably unable to share much of the work on the PDF system as it is using a proprietary library for parsing the PDFs (PDFLibTet). The PDF renderer is actually written in PHP (...), you can see an example of the search and display by going to <br />http://www.jusmeum.de/suche?search%5Bquery%5D=verbraucher+widerrufsrecht+nachricht <br />and clicking one of the results (these are German legal judgments ;-) That is still using the regexp fragmenter for highlighting, but I am probably going to switch to the PostingsHighlighter since all my tests so far have been positive.Adrian Pemselhttp://www.jusmeum.denoreply@blogger.com