tag:blogger.com,1999:blog-8623074010562846957.post5762579240013296941..comments2023-09-01T03:38:08.236-04:00Comments on Changing Bits: New index statistics in Lucene 4.0Michael McCandlesshttp://www.blogger.com/profile/04277432937861334672noreply@blogger.comBlogger17125tag:blogger.com,1999:blog-8623074010562846957.post-1291851477627008062017-12-13T18:40:31.151-05:002017-12-13T18:40:31.151-05:00Can you load your norms at search time and sum the...Can you load your norms at search time and sum them up and compute the mean? You'd just have to do it on each searcher init.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-64359924662077548502017-12-10T02:59:21.203-05:002017-12-10T02:59:21.203-05:00Hello, I calculate the average term frequency for ...Hello, I calculate the average term frequency for each document in Similarity.computeNorm and encode it into my norm. However, I need also the corresponding mean value over all documents.<br /><br />Does anybody have a suggestion how I could achieve this?Anonymoushttps://www.blogger.com/profile/08099589885699466588noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-2155774665073647562014-11-02T05:16:25.629-05:002014-11-02T05:16:25.629-05:00Hi Arnaldo,
Can you ask this question on the Luce...Hi Arnaldo,<br /><br />Can you ask this question on the Lucene user's list (java-user@lucene.apache.org)?Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-57408862391278011242014-10-23T05:23:54.755-04:002014-10-23T05:23:54.755-04:00Hi Mike,
I'd like to know if there is any Anal...Hi Mike,<br />I'd like to know if there is any Analyzer that (at indexing time) doesn't tokenize on white spaces, but allows me to manage multi word phrases as single terms so that the API described in this post will keep woking as expected. In my use case I've a set of phrases that I need TermsEnum.docFreq(). Is something possible? In the case there isn't anything like that, would be trivial to implement a custom analyzer? Many Thanks Anonymoushttps://www.blogger.com/profile/12636548630654337376noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-4473794466860647002013-08-21T07:37:26.985-04:002013-08-21T07:37:26.985-04:00Hi Bintang,
Can you send questions to the users&#...Hi Bintang,<br /><br />Can you send questions to the users's list (java-user@lucene.apache.org)? Include the code fragment you're currently using and the exception of what went wrong ...Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-20167344900353761172013-08-20T01:48:31.423-04:002013-08-20T01:48:31.423-04:00Hey Mike
I'm trying to understand what you e...Hey Mike <br /><br />I'm trying to understand what you explained but I'm pretty confused. I try to create the TermsEnum by the IndexReader but it always giving me null exception.Anonymoushttps://www.blogger.com/profile/17565675987787089527noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-26656789643258163782013-01-29T13:35:11.495-05:002013-01-29T13:35:11.495-05:00Anonymous,
That stat is available at indexing tim...Anonymous,<br /><br />That stat is available at indexing time, passed to your Similarity.computeNorm in the FieldInvertState argument, member uniqueTermCount.<br /><br />If you need this at search time you'll have to save it away somewhere ...Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-23711421872423125072013-01-29T11:38:24.770-05:002013-01-29T11:38:24.770-05:00Hi Mike,
How could I get the number of the unique...Hi Mike,<br /><br />How could I get the number of the unique terms in a document in my Similarity impl?<br /><br />Thanks in advance.<br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-6704076689985973892012-12-25T05:37:17.502-05:002012-12-25T05:37:17.502-05:00Hi hossein,
You need to first get a TermsEnum (Te...Hi hossein,<br /><br />You need to first get a TermsEnum (Terms.iterator(null)), seek to that term (termsEnum.seekExact: verify you got back "true" indicating that the term exists), then use its .advance method to skip to that docID (verify the returned docID is the one you advanced to), then call freq().Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-16465118077643020332012-12-23T22:56:15.519-05:002012-12-23T22:56:15.519-05:00Hi Mike,
How can I get the freq of a term in a doc...Hi Mike,<br />How can I get the freq of a term in a document?<br />Thanks in advance.hossein tahanihttps://www.blogger.com/profile/10866729224229825697noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-24083390757971910582012-08-28T12:32:51.675-04:002012-08-28T12:32:51.675-04:00Hi David,
I think it's actually an artifact f...Hi David,<br /><br />I think it's actually an artifact from back when TFIDFSim was the only<br />Similarity in Lucene, that its decodeNormValue is still<br />public... really how the sim stores its stats (and what stats it<br />stores) are private implementation details to it.<br /><br />If you really want the int per field X doc ... you could clone<br />BM25Sim, but then instead of norm.setByte(...) in computeNorm, use<br />norm.setInt (and then remove all the encode/decodeNormValue stuff).<br />Then in the Exact/Sloppy scorers, pull the int[] docLengths instead of<br />byte[] norms. The downside of course is 4X the RAM per field X doc.<br />Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-40962658315428700222012-08-25T15:51:25.778-04:002012-08-25T15:51:25.778-04:00Hi Mike
Thanks for the quick response.
I'm ex...Hi Mike<br />Thanks for the quick response.<br /><br />I'm experimenting with BM25 and the new Language models based similarities in Lucene4. It turns out that the method decodeNormValue() is public for TFIDFSimilarity while protected for the rest. Of course I can expand these similarities classes to publicize this method, so my question is whether there is a special reason why it should not be publicized by default.<br /><br />In general, I really like the new flexibility with the similarity measures provided by Lucene 4. I would however wish to have a more simple and intuitive API to extract (exact) doc-length from the index. I think these values should provided by the API and should not be handled by the application (just a wishful thinking) <br /><br />thanks, and all the best<br />DavidAnonymoushttps://www.blogger.com/profile/05252285387500817886noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-73456730442660424732012-08-24T06:38:23.310-04:002012-08-24T06:38:23.310-04:00Hi David,
Which similarity are you using? Just t...Hi David,<br /><br />Which similarity are you using? Just the default (DefaultSimilarity)? If so, then you can get the docLength from the norm by calling DefaultSimilarity.decodeNormValue. Note that if you boosted the field during indexing it will change this document length (higher boost makes doc length smaller). Also note that this is heavily quantized: we use only a single byte to encode the length by default.<br /><br />If the quantization or mixin of boost is a problem ou can also make your own Similarity impl and store docLen yourself (eg as an int).Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-20475350213086945582012-08-24T04:57:18.738-04:002012-08-24T04:57:18.738-04:00Hi
I'm trying to figure out in Lucene 4 the do...Hi<br />I'm trying to figure out in Lucene 4 the document length (total number of keywords it contains) from the norm vector at query run-time. For some (unclear) reason the encode method of the norm values is not public.<br />Any way of doing it? Anonymoushttps://www.blogger.com/profile/05252285387500817886noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-47353700574689712862012-08-23T18:56:16.592-04:002012-08-23T18:56:16.592-04:00Hi Anonymous,
You should pull a DocsEnum for that...Hi Anonymous,<br /><br />You should pull a DocsEnum for that field + term, advance to the docID and call .freq() to get that count. This is also available in term vectors (if you want to get the count for all terms within a single doc).Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-78729001757929371432012-08-23T14:52:46.414-04:002012-08-23T14:52:46.414-04:00How can I get the count of a term for a given docu...How can I get the count of a term for a given document and field?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-36141107967573468412012-03-22T01:19:35.313-04:002012-03-22T01:19:35.313-04:00This is super-awesome! Thanks Mike!This is super-awesome! Thanks Mike!John Wanghttps://www.blogger.com/profile/00124403675406755385noreply@blogger.com