Comments on Changing Bits: Building a new Lucene postings format

Hi Igor, I suggest starting with Lucene's &quo...

2017-11-21T10:02:36.194-05:00

Hi Igor, I suggest starting with Lucene's "demo" module, specifically IndexFiles.java and SearchFiles.java. And then send questions to Lucene's users list (java-user@lucene.apache.org).

Hello Mike, I'm starting to study Lucene, and ...

2017-11-18T02:01:03.419-05:00

Hello Mike,
I'm starting to study Lucene, and I'm curious to see how the inverted index of the document gets after indexing some files.
I would like to know what steps I should take and which classes to use.
Thanks if you can respond.

Hi Mike, Thanks a lot for your reply. I could solv...

2014-07-07T06:40:56.929-04:00

Hi Mike,
Thanks a lot for your reply. I could solve some of the easier issues listed above. Will put a new question based on the remaining issues.

Hi Aditya, Sorry for the slow response here; if y...

2014-07-06T06:31:04.658-04:00

Hi Aditya,

Sorry for the slow response here; if you haven't already, could you ask this question on the Lucene users list instead? (java-user@lucene.apache.org).

Hi Mike, We were trying to use a configurable PerF...

2014-06-16T06:06:48.150-04:00

Hi Mike,
We were trying to use a configurable PerFieldPostingsFormat to get an updateable field with a key-value store. Something similar to what you have linked in your post(Flax). However, I have a doubt that an updateable field is possible using codecs for the following reasons:

Assuming the key for this key-value store is: segmentName_fieldName_term and value is the postingsArray: say just array of docIds.

Assume there is a field PK (id kind of field) which is stored and uses default codec.

#1. fieldConsumers of PostingsFormat are called only at flush time. With this design handling an update for a document not yet flushed seems difficult for a field writing to a key-value store.
So if I add a document with say PK:1 and then add another partial update document for PK:1 then to update this, we first need to figure out the segmentName for this document, only then we can update it in key-value store. But this doc has no segmentName as of now. Because of DWPTs we can't even assume that the document will be in this same segment.

Note: Since the custom PostingsFormat is only called at flush time, we write the postings data directly to the key-value store.

#2:Merged segments are just checkpointed, they are not committed. So, for the custom TermsConsumer.merge() call we can not write the new merged state to the key-value store directly. Note that merge() and flush() both use same methods like startTerm(),startDoc(),finishTerm() etc. And since in these methods the design is to write directly to key-value store, we do not use TermsConsumer.merge() function. We actually put the new merged state in-memory but have no good signal to flush it to the key-value store. We are flushing this in-memory merge state at the next flush. However, Lucene can commit the checkpointed merged segment without any document to be flushed. So we get into a state where Lucene Index Dir has new merged segments, the key-value store is not yet updated with that. If we write the merged state directly, then we get into a state where key-value store has the new merged segment but not the Lucene Directory.

#3 - Dropped Segments. Again can't handle this as the only time our custom consumers can remove deleted docs is at merge time. But Lucene can drop segments during normal commits.

#4 - Replication: We do take index files and key-value store files for replication. We can not keep both of them in sync.

#5 - Applying updates to a reindexed document when reindexed document is taken by a DWPT which is flushed after the partial update document. Updates are lost in this case.

Currently, I plan to try adding SegmentInfosFormat(May be LiveDocsFormat too) and a DirectoryWrapper but as if now I do not have any idea if it will solve all problems.

It will be great to know your thoughts on this on the feasibility of providing updateable fields using codecs.

I will convert and run all span queries as conjunc...

2014-04-03T15:53:57.520-04:00

I will convert and run all span queries as conjunction queries and then use the output as filter query for the original span query. That might reduce the scope of span query substantially.

Thanks for the idea.

Span queries are unfortunately very costly. Have ...

2014-04-02T15:12:44.072-04:00

Span queries are unfortunately very costly. Have you tried a sloppy PhraseQuery instead? They will still be costly but perhaps less-so.

However, even better: do a simpler (non-positional) query and then use the new rescoring APIs to re-sort the top e.g. 500 hits with your positional queries.

DirectPostingsFormat is a very RAM-heavy, but fast, postings format.

You don't need termOffsets for span queries; that's typically used for highlighting with PostingsHighlighter.

Thanks, so is there a better postingsformat to cho...

2014-04-02T13:29:48.631-04:00

Thanks, so is there a better postingsformat to choose from while using span queries in solr 4.6 or solr4.7? Since, span queries utilize all the parts of postings like terms, documents, frequencies, positions and offsets.

When I am debugging the lucene 4.6 test cases for span queries, it is showing that for above nextdoc() call it is utilizing DirectPostingsFormat.

Does having termPositions=true and termOffsets=true help?

Yes, Lucene41PF is still the default, so it's ...

2014-04-02T13:15:53.420-04:00

Yes, Lucene41PF is still the default, so it's used for all queries to iterate docs/positions. MemoryPF isn't really a good match: it's best for primary key fields (each term matches just one document). Term vectors are unrelated: they are not used during searching, and they are quite slow to retrieve per document.

Does lucene 4.7 use Lucene41PostingsFormat for Pos...

2014-04-01T17:37:16.869-04:00

Does lucene 4.7 use Lucene41PostingsFormat for Postings.nextdoc() while executing the span queries? My requirement is to run multiple span queries like [cat dog]~2 on 2 TB of index and I am worried about the performance as I have to collect all the docs in results. Is there a better postings format to choose from like memory postings while using span queries in solr 4.7?? Does having termvectors help me in this regard?

I don't think there's any documentation fo...

2013-08-21T07:34:37.681-04:00

I don't think there's any documentation for this besides the source code itself; could you ask on the dev list (dev@lucene.apache.org)?

I read http://lucene.apache.org/core/3_0_3/filefor...

2013-08-20T16:26:25.969-04:00

I read http://lucene.apache.org/core/3_0_3/fileformats.html but not sure how things work when a query arrives.

Is there documentation about how lucene works inte...

2013-08-20T16:22:33.241-04:00

Is there documentation about how lucene works internally when a one does a query say "title:Foo" ? e.g. how the tii, tis, prx, fdt, etc.. lookuped and in what order to get the information.

Thanks Gora, the only IR book is a wonderful resou...

2013-05-28T07:47:25.935-04:00

Thanks Gora, the only IR book is a wonderful resource!

Some advice to all of those asking about what post...

2013-05-28T07:29:49.926-04:00

Some advice to all of those asking about what postings are: have a look at the online IR Book at http://www-nlp.stanford.edu/IR-book/ . The first few chapters will help you understand the theory behind postings and inverted index. After reading the IR Book, this blog post makes much more sense! :-) Thanks Mike.

Hi John, That's a good question ... I'm n...

2013-05-10T15:42:13.812-04:00

Hi John,

That's a good question ... I'm not sure? If any of the codec APIs changed (which is likely: they are marked experimental) then you'd need to upgrade lucene core as well. Try it and report back ;)

Thanks Mike. Lot clear to me now :) I see that Lu...

2013-05-10T09:22:02.158-04:00

Thanks Mike. Lot clear to me now :)

I see that Lucene 4 used Lucene40 Postings format by default. Would it be possible to use BlockPostingsFormat and still be on Lucene4 or do we need to upgrade to Lucene4.1?

-John

Hi Anonymous, A Codec holds many formats: posting...

2013-05-09T06:38:28.760-04:00

Hi Anonymous,

A Codec holds many formats: postings, term vectors, deletions, etc. PostingsFormat just covers how postings (= the inverted part of the index, i.e. fields, terms, docs, positions, freqs, offsets) are written.

There are very different performance characteristics for each PostingsFormat ... I would say the best one is the current default one (BlockPostingsFormat): it's fast for frequent terms because it bulk encodes/decodes the integers.

I don't think there's any good document that describes the steps in building a PostingsFormat. I would start by looking at the PostingsFormat impls in the Lucene sources? Typically one builds a PostingsBaseFormat, which plugs into the terms dictionary, because building a new terms dictionary is challenging ... the PostingsBaseFormat only needs to encode the docs/freqs/positions/offsets.

Mike Lucene newbie here. Sorry if my question sou...

2013-05-09T05:08:11.621-04:00

Mike

Lucene newbie here. Sorry if my question sounds silly.
Still trying to understand the terminology related to Postings/PostingFormat/Codec. Are they all same?

From what you said, if custom postingsformat controls how the data is stored in index, would it be right to say that for the same dataset being loaded into index, different postingformats have different performance times?

What would be the best postingformat to use if speed is important.

Also can you point me to documentation/example to build custom postingformat using java?

Thanks
John

Thanks Mike Sir for answering.

2013-01-03T07:03:52.345-05:00

Thanks Mike Sir for answering.

The PostingsFormat controls how the inverted index...

2012-10-13T11:21:12.528-04:00

The PostingsFormat controls how the inverted index (terms mapping to list of documents that contain that term, plus things like frequency, positions, offsets) is stored in the index.

what is Posting format ? Is it something related t...

2012-10-13T10:48:39.218-04:00

what is Posting format ?
Is it something related to the way string is sent to the searcher ??