tag:blogger.com,1999:blog-8623074010562846957.post8418209192171448474..comments2023-09-01T03:38:08.236-04:00Comments on Changing Bits: Catching slowdowns in LuceneMichael McCandlesshttp://www.blogger.com/profile/04277432937861334672noreply@blogger.comBlogger26125tag:blogger.com,1999:blog-8623074010562846957.post-36380482788412470582017-11-01T06:16:59.232-04:002017-11-01T06:16:59.232-04:00Alas, no, not yet. Patches welcome! Python has t...Alas, no, not yet. Patches welcome! Python has the helpful psutil module that should make this straightforward.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-12928347264487365532017-11-01T05:03:26.125-04:002017-11-01T05:03:26.125-04:00Do you have graphs of CPU utilization and disk IOP...Do you have graphs of CPU utilization and disk IOPS during tests?Anonymoushttps://www.blogger.com/profile/07806397367062165026noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-36619365350094783902014-07-06T05:45:05.716-04:002014-07-06T05:45:05.716-04:00Hi Gili,
No, each query type is wildly different ...Hi Gili,<br /><br />No, each query type is wildly different from the other query types so you really cannot compare them. You can only compare a query type with itself from different days ...<br /><br />An in-memory index/codec format will affect different queries differently. E.g. the switch to MemoryPF for the "id" field was a big speedup...Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-3809645165711701862014-07-06T03:16:37.612-04:002014-07-06T03:16:37.612-04:00Hi Mike, I'm trying to read the benchmark as a...Hi Mike, I'm trying to read the benchmark as a way to learn the relative cost of different queries. Are the different query results comparable to each other?<br /><br />They seems a bit counter intuitive to me: Wildcard query (15 QPS) is just 2x slower than Term query (30 QPS). FuzzyQuery (edit distance 2) is faster than both (40 QPS). Primary key lookup is in another sphere altogether (800 KQPS).<br />Perhaps QPS is very close, as I/O is a bottleneck, and in a memory resident index they would be very different?Gili Nachumhttps://www.blogger.com/profile/16996461994293511014noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-23228438880622352102014-04-28T04:55:53.011-04:002014-04-28T04:55:53.011-04:00We just have for every term a posting list where t...We just have for every term a posting list where the term occurs. Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-26630008719279125602014-04-28T00:50:21.627-04:002014-04-28T00:50:21.627-04:00replied name of thread is Posting list.replied name of thread is Posting list.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-20852280401855534972014-04-24T09:12:02.501-04:002014-04-24T09:12:02.501-04:00Hi, maybe you can ask this on Lucene's dev lis...Hi, maybe you can ask this on Lucene's dev list? (dev@lucene.apache.org).Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-88002473832457793272014-04-22T08:03:01.278-04:002014-04-22T08:03:01.278-04:00Can you explain how dictionary are linked with thi...Can you explain how dictionary are linked with this implementation of posting lists. In traditional case we have dictionary like hashmap[String,List(int,int)] //word -> docid, termfreq. In this case dictionary points to "parallel arrays" slots and in the "poitner array" points to most recent docid in the posting list what means "to search the posting list" in other words how this maps to List(int,int) partAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-36365581274557204792014-01-13T13:35:10.889-05:002014-01-13T13:35:10.889-05:00Yes, just open an IndexWriter on the index and add...Yes, just open an IndexWriter on the index and add the new documents to it, delete old documents, etc.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-81266077339952294752014-01-13T12:42:06.578-05:002014-01-13T12:42:06.578-05:00Hi Sir, I have a large database( 1 TB ) I want to ...Hi Sir, I have a large database( 1 TB ) I want to make index using lucene 4.0 Then I want to re-index or updated index or updated value to be index.. if i delete all index then make index from start then it may take huge time.. please sir,<br />Is there any way to update index incremental ..... mugeeshhttps://www.blogger.com/profile/18185471791674338100noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-57562280306432757762013-08-26T08:09:29.827-04:002013-08-26T08:09:29.827-04:00Hi Anonymous,
That address is correct; try again ...Hi Anonymous,<br /><br />That address is correct; try again later / from a different network?Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-46737081428780254712013-08-23T19:12:34.042-04:002013-08-23T19:12:34.042-04:00Awesome benchmark program, however I've been u...Awesome benchmark program, however I've been unable to get the data files:<br />--2013-08-23 15:52:53-- http://people.apache.org/~mikemccand/enwiki-20120502-lines-1k.txt.lzma<br />Resolving people.apache.org (people.apache.org)... 140.211.11.9<br />Connecting to people.apache.org (people.apache.org)|140.211.11.9|:80... failed:<br />Connection timed out.<br />Retrying.<br /><br />Correct address? Will try later... Thanks!Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-3164754810167285612013-07-18T10:28:40.268-04:002013-07-18T10:28:40.268-04:00Yes i see. Thank you for understanding provided.Yes i see. Thank you for understanding provided.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-44541092039839715952013-07-18T10:20:40.284-04:002013-07-18T10:20:40.284-04:00The speedup from multiple threads depends entirely...The speedup from multiple threads depends entirely on how concurrent your hardware is; I'd suggest at most 2*number-of-CPU-cores search threads. If your hardware has 10 fold concurrency (CPU and IO) then yes you should hit 300 QPS with 10 search threads.<br /><br />For RAM resident index, it's best to use MMapDirectory and let the OS manage the RAM; if there's is plenty of free RAM for it (ie, you keep your JVM heap sizes low) then it will hold the entire index (or at least the "hot" parts) in RAM. The speedup of a hot index over a cold index is enormous in many cases, because seeking is exceptionally costly for spinning-magnet disks and still costly even for SSDs.<br />Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-16209678977881847982013-07-18T09:52:26.275-04:002013-07-18T09:52:26.275-04:00oh.. and what it would be the relative performance...oh.. and what it would be the relative performance serving from ram resident index.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-85582934800838197772013-07-18T09:47:28.396-04:002013-07-18T09:47:28.396-04:00Thanks for clearing that, what is common sweetspot...Thanks for clearing that, what is common sweetspot for search threads if am using 10 threads(assuming i have tunned lucene right) can i expect 300 QPS. How many threads begin to affect performance or its app specific?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-20198427325654157452013-07-18T09:36:48.573-04:002013-07-18T09:36:48.573-04:00OK I see, and thanks for sharing the link to the E...OK I see, and thanks for sharing the link to the Earlybird paper.<br /><br />I'm not familiar with Earlbird's design, but scanning the paper it's clearly been heavily customized to match Twitter's specific needs (e.g. encoding a position in 8 bits since a tweet is at most 140 chars). It's also fully RAM resident, I think, and the search can terminate early since it's always sorted in reverse chronological order.<br /><br />Vs the 30 QPS number which is a general-case search on larger docs, using a single thread.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-57612564187266436082013-07-18T09:31:39.666-04:002013-07-18T09:31:39.666-04:00or its just that the index is in RAMor its just that the index is in RAMAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-50999276201681778072013-07-18T09:00:49.735-04:002013-07-18T09:00:49.735-04:00Was referring to ruffly 30 QPS. I agree that their...Was referring to ruffly 30 QPS. I agree that their situation is to searching few things(tweet text, and usernames) but How they manage to get 5000QPS or i am wrong. Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-31253382273308007282013-07-18T08:54:21.764-04:002013-07-18T08:54:21.764-04:00Section VII Deployment and Performance
http://ww...Section VII Deployment and Performance <br /><br />http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdfAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-64155545392158970732013-07-18T08:31:53.871-04:002013-07-18T08:31:53.871-04:00Hi Anonymous,
I don't understand the numbers ...Hi Anonymous,<br /><br />I don't understand the numbers you're referring to; can you share the documents from twitter? Which benchmark of mine are you getting 30 ms latency from? And is 5000 RPS queries per second (search time) or tweets/second (indexing time)?Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-68460728202193120092013-07-17T17:44:21.390-04:002013-07-17T17:44:21.390-04:00Reading documents from twitter they claim that ful...Reading documents from twitter they claim that fully loaded machine with 144M tweets cant handle 5000 RPS with latency 150ms. Looking at your benchmark test they do around 30. I am sure that i am missing something (like i am compering apple to oranges). So i will ask for some explanation please.<br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-67934608948261350402012-05-15T06:39:13.571-04:002012-05-15T06:39:13.571-04:00Hi Anonymous,
The server has 2 Xeon X5680s so a t...Hi Anonymous,<br /><br />The server has 2 Xeon X5680s so a total of 24 cores (6 cores/cpu X 2 cpus X 2 for hyperthreading), overclocked to 4.0 Ghz. 12 GB of RAM. OS is Fedora 13. Index is written to a 240 GB OCZ Vertex 3, and content is read from a separate spinning-magnets hard drive.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-42088218466715944922012-05-14T20:28:40.867-04:002012-05-14T20:28:40.867-04:00May I ask what is the hardware specification of yo...May I ask what is the hardware specification of your testing server?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-89261252578909360562011-05-14T18:07:54.304-04:002011-05-14T18:07:54.304-04:00It would be great to have similar tests for Solr s...It would be great to have similar tests for Solr so we could catch Solr-specific slowdowns.<br /><br />It's certainly doable (it's just software!), but I don't think I'm going to have time near term to build this out.<br /><br />I would really love to see us get there though...<br /><br />Patches welcome ;)Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.com