Comments on Changing Bits: Transactional Lucene

I have a situation where my data is fully indexed....

2015-08-25T01:48:13.711-04:00

I have a situation where my data is fully indexed. It is possible to make changes to this data and this delta changes gets indexed too, creating new segment files.
But I should prevent merging of the index data corresponding to delta changes with the original(big) index file.

I was thinking about increasing the MergeFactor to high values so that automatic merging gets disabled(and decide on a custom merge point). But I read that this has implications of running out of file descriptors.

How can I segregate delta index changes efficiently?

Also, it would be better if I can merge the segments created due to delta changes alone without touching the original index file. Is there any option to do this?

12:44:24,404 ERROR [stderr] (Lucene Merge Thread #...

2015-08-10T10:31:14.702-04:00

12:44:24,404 ERROR [stderr] (Lucene Merge Thread #0) Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: Error loading metadata for index file: _2t.cfs|M|Event
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:549)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:522)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) Caused by: java.io.FileNotFoundException: Error loading metadata for index file: _2t.cfs|M|Event
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.infinispan.lucene.impl.DirectoryImplementor.openInput(DirectoryImplementor.java:134)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.infinispan.lucene.impl.DirectoryLuceneV4.openInput(DirectoryLuceneV4.java:101)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.store.CompoundFileDirectory.(CompoundFileDirectory.java:104)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:274)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:107)
12:44:24,405 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ReadersAndUpdates.getReaderForMerge(ReadersAndUpdates.java:664)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4152)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3811)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:409)
12:44:24,406 ERROR [stderr] (Lucene Merge Thread #0) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:486)

i use Infinispan 7 and Lucene 4.10.4

2015-08-10T10:13:19.737-04:00

i use Infinispan 7 and Lucene 4.10.4

hi Michael, i have this situation: every 1 minute ...

2015-08-10T10:08:03.125-04:00

hi Michael,
i have this situation:
every 1 minute a job (ejb timer) from 0 to max 1000 document in index, i start transaction in this mode:

org.infinispan.Cache cache = (org.infinispan.Cache) getCacheGetter().getTreeCache("Index");
TransactionManager transactionManager = cache.getAdvancedCache().getTransactionManager();
transactionManager.begin();

i open index in this mode:
Directory index = DirectoryBuilder.newDirectoryInstance(cache, cache, cache, cacheName).create();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);<-- StandardAnalyzer

config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter indexWriter = new IndexWriter(index, config);

i added more document in cycle:
doc.add(new StringField("name", value, Field.Store.YES));
indexWriter.addDocument(doc); or updateDocument if items exist

indexWriter.commit();
indexWriter.close();

transactionManager.commit();

well, when the system is under stress i have this exception:

(Lucene Merge Thread #0) Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.io.FileNotFoundException: Error loading metadata for index file: _9.cfs|M|Event

where am I doing wrong???

Also, I'm answering wrt java Lucene ... I'...

2015-08-03T04:47:16.781-04:00

Also, I'm answering wrt java Lucene ... I'm not sure whether Lucene.Net has additional APIs here, but likely not (it tries to be a pure port, I think).

Sorry, that's the only way w/ Lucene's API...

2015-08-03T04:46:22.020-04:00

Sorry, that's the only way w/ Lucene's APIs. Even when you IW.rollback with an already opened IW (to rollback all changes since the last commit), that also closes the IW.

Ok..thanks for the answer..so..this means that ope...

2015-08-02T14:02:04.137-04:00

Ok..thanks for the answer..so..this means that opening index writer is the only way to roll back to a commit point.? Opening index writer only for the rollback is not ideally a choice of mine..as I have to do this in the application shutdown procedure..I will have to close the already opened writer..reopen it to rollback to a commit..and then close it immediately..will any other option work in my case..like deleting a commit or so? (For lucene.net)

Yes: use IndexWriterCommit.setIndexCommit and then...

2015-07-31T11:04:19.695-04:00

Yes: use IndexWriterCommit.setIndexCommit and then open IndexWriter. This will rollback to that commit point, removing any later commit points.

Very informative post! I have a situation where I ...

2015-07-31T08:44:21.248-04:00

Very informative post!
I have a situation where I have 2 commit points, one corresponding to saved application data and one corresponding to unsaved data. When I close the application, I want to delete the data(rollback) data corresponding to unsaved commit so that the unsaved data will not be in index anymore.
Is there a way to rollback or delete this data other than reopening the index writer with the Saved commit?

Maybe oal.index.TestStressIndexing2? It has a ser...

2013-11-15T12:42:35.652-05:00

Maybe oal.index.TestStressIndexing2? It has a series of "verifyEquals" methods.

Hi Mike, can you please spot the above mentioned ...

2013-11-14T04:28:59.173-05:00

Hi Mike,

can you please spot the above mentioned closish test code in the test framework. I've managed to find only TestIndicesEquals[1] test case but this is rather for index equality check. Is this the one you mentioned above? Can you please give me some advice how to implement such a differ in general? I'm eager to contribute it back to the community if my trials would end up some useful feature.

Thanks in advance for your help!

Best,

Akos

[1]: http://svn.apache.org/repos/asf/lucene/dev/branches/preflexfixes/lucene/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java

I don't think there's any need for such a ...

2013-11-06T09:40:29.891-05:00

I don't think there's any need for such a check in production code...

Sorry, I meant that the check provides no benefit ...

2013-11-06T08:45:45.807-05:00

Sorry, I meant that the check provides no benefit (although it seems it may). Would you leave this type of check in production code though or would it proof expensive for a large Index. Currently there is no check for certain documents, just a sanity check around the total doc count.

Thanks.

Hi Anonymous, What do you mean by "is futile...

2013-11-06T06:52:42.203-05:00

Hi Anonymous,

What do you mean by "is futile"?

Once the commit succeeds, all the changes are on disk. But if you want to go and further verify that certain documents are present, that can still be worthwhile, e.g. it could catch bugs in your application where certain docs were not indexed before the commit.

Taking all this into account am I right in saying ...

2013-11-05T16:26:10.174-05:00

Taking all this into account am I right in saying that after performing a Commit, doing a "Verify Success" by using an IndexReader to open the new index and getting numDocs to ensure it matches what you expect is futile?

That's a brilliant idea. Thank you!

2013-10-06T05:37:03.445-04:00

That's a brilliant idea. Thank you!

Thanks for opening that issue; I just added a comm...

2013-10-03T07:24:47.584-04:00

Thanks for opening that issue; I just added a comment!

Ahh, I see. So in that case the first user gets &...

2013-10-03T07:24:11.131-04:00

Ahh, I see. So in that case the first user gets "denial of service" due to first user hogging IndexWriter. I suppose, you could do custom scheduling, i.e. stop the first user's indexing, close IW, open new IW (on 2nd user's latest commit point), index 2nd user, close IW, open IW on first user's commit point and "resume" indexing first user's docs. Tricky :)

First off, apologies for my late reply. You'r...

2013-10-01T04:36:40.961-04:00

First off, apologies for my late reply.

You're right, indexing is fast enough in general. But in my case it may happen that the changes made of a user requires only a couple of document addition, deletion and update (which is actually only a couple of milliseconds), while changes performed by another user (on a different branch but in the same time) may required hundreds of thousand document deletion and update (which takes seconds). Since there can be only one shared IW, it may happen that the first user has to wait seconds although only a few modifications were made.

Anyway, thanks for your hints and tips and for you great work what you're doing for the community. Much appreciated.

Best,

Akos

I've addressed the issue as you've request...

2013-10-01T04:21:12.992-04:00

I've addressed the issue as you've requested.
https://issues.apache.org/jira/browse/LUCENE-5250

Also, that's a rather annoying limitation that...

2013-09-19T17:22:23.918-04:00

Also, that's a rather annoying limitation that you cannot open a specific IndexCommit with ReaderManager. Can you open a Jira issue? Maybe we can fix that ... thanks!

Right, only one IW at a time on a given index. Yo...

2013-09-19T17:13:23.446-04:00

Right, only one IW at a time on a given index.

You can use addIndexes to add in all documents from an external index, but this does not "replace" the updated documents. You'd have to separately keep track of the deletions, and then addIndexes. But indexing is quite fast ... are you sure you need to optimize this?

Works as a charm. Great feature. Am I correct ass...

2013-09-18T09:25:29.742-04:00

Works as a charm. Great feature.

Am I correct assuming, that concurrent commits performed on different 'branches' are not available in my case, since only one IW can be opened on the index.

If concurrent writing to an index is not available, and I would like to avoid IW synchronization I guess I have to open a new private 'slave' index for each 'branch' writer. After committing changes I have to open a writer on the 'master' directory and merge the changes from the 'slave'. Is there a way to merge deletions as well?

Thanks in advance for your reply.

Best,

Akos

Hi Mike, first off, thank you for the quick reply...

2013-09-17T13:26:42.132-04:00

Hi Mike,

first off, thank you for the quick reply.

I've implemented the IDP exactly like you've suggested, I opened the readers on the HEAD commit points as well. For doing this I've extended the ReferenceManager as I was unable to find other way to have a reader opened on a particular IndexCommit via ReaderManager. I think the only difference is the way I tried to use the IW. I'll check my test code ASAP and response here if I got the solution.

Thanks again for the help.

Best,

Akos

I think you can index multiple branches into a sin...

2013-09-17T11:55:28.555-04:00

I think you can index multiple branches into a single index; your IDP would have to track which commit points correspond to the head of which branch. But since you can only have one IW open at once, you'd have to open an IW on an old commit point, keep the other commit points that are heads of other branches, make all changes for this one branch, and close (deleting the previous head commit for the current branch).

For searching you can of course open multiple readers, one on each head commit point. I would open one reader first, then use openIfChanged from that one to get to the other head commit points.