<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8623074010562846957</id><updated>2012-02-06T11:49:12.808-05:00</updated><category term='Python'/><category term='Kids'/><category term='OpenSolaris'/><category term='Compact Language Detector'/><category term='Lucene'/><category term='Health'/><category term='ZFS'/><category term='Home automation'/><title type='text'>Changing Bits</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>83</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5011616579658774614</id><published>2012-01-14T11:19:00.000-05:00</published><updated>2012-01-14T11:19:13.107-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>ToChildBlockJoinQuery in Lucene</title><content type='html'>Inmy &lt;a href="http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html"&gt;lastpost&lt;/a&gt; I described a known limitationof &lt;code&gt;BlockJoinQuery&lt;/code&gt;: it joins in only one direction (fromchild to parent documents).  This can be a problem because someapplications need to join in reverse (from parent to child documents)instead.&lt;br&gt;&lt;br&gt;This is now fixed!  Ijust &lt;a href="https://issues.apache.org/jira/browse/LUCENE-3685"&gt;committed&lt;/a&gt;a new query, &lt;code&gt;ToChildBlockJoinQuery&lt;/code&gt;, to perform the joinin the opposite direction.  I also renamed the previous queryto &lt;code&gt;ToParentBlockJoinQuery&lt;/code&gt;.&lt;br&gt;&lt;br&gt;You use it just like &lt;code&gt;BlockJoinQuery&lt;/code&gt;, except in reverse:it wraps any other &lt;code&gt;Query&lt;/code&gt; matching parent documents andtranslates it into a &lt;code&gt;Query&lt;/code&gt; matching child documents.  Theresulting &lt;code&gt;Query&lt;/code&gt; can then be combined with other queriesagainst fields in the child documents, and you can then sort by childfields as well.&lt;br&gt;&lt;br&gt; Using songs and albums as an example: imagine you index eachsong (child) and album (parent) as separate documents in a singledocument block.  With &lt;code&gt;ToChildBlockJoinQuery&lt;/code&gt;, you can nowrun queries like:&lt;pre&gt;&lt;br /&gt;  albumName:thunder AND songName:numb&lt;br /&gt;&lt;/pre&gt;or&lt;pre&gt;&lt;br /&gt;  albumName:thunder, sort by songTitle&lt;br /&gt;&lt;/pre&gt;Any query with constraints against album and/or song fields will work,and the returned hits will be individual songs (not grouped).&lt;br&gt;&lt;br&gt;&lt;code&gt;ToChildBlockJoinQuery&lt;/code&gt; will be available in Lucene 3.6.0and 4.0.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5011616579658774614?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5011616579658774614/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5011616579658774614'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5011616579658774614'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.html' title='ToChildBlockJoinQuery in Lucene'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-6483188241817199309</id><published>2012-01-08T18:52:00.000-05:00</published><updated>2012-01-08T18:52:36.503-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Searching relational content with Lucene's BlockJoinQuery</title><content type='html'>&lt;a href="http://lucene.apache.org/java"&gt;Lucene's&lt;/a&gt; &lt;a href="http://lucene.apache.org/java/docs/index.html#14+September+2011+-+Lucene+Core+3.4.0"&gt;3.4.0release&lt;/a&gt; adds a new feature called &lt;em&gt;index-time join&lt;/em&gt; (alsosometimes called sub-documents, nested documents or parent/child documents),enabling efficient indexing and searching of certain typesof &lt;a href="http://en.wikipedia.org/wiki/Relational_model"&gt;relationalcontent&lt;/a&gt;.&lt;br&gt;&lt;br&gt; Most search engines can't directly index relational content,as documents in the index logically behave like a single flat databasetable.  Yet, relational content is everywhere!  A job listing site haseach company joined to the specific listings for that company.  Eachresume might have separate list of skills, education and past workexperience.  A music search engine has an artist/band joined to albumsand then joined to songs.  A source code search engine would haveprojects joined to modules and then files.&lt;br&gt;&lt;br&gt;Perhaps the PDF documents you need to search are immense, so you breakthem up and index each section as a separate Lucene document; in thiscase you'll have common fields (title, abstract, author, datepublished, etc.) for the overall document, joined to the sub-document(section) with its own fields (text, page number, etc.).  XMLdocuments typically contain nested tags, representing joinedsub-documents; emails have attachments; office documents can embedother documents.  Nearly all search domains have some form ofrelational content, often requiring more than one join.&lt;br&gt;&lt;br&gt;If such content is so common then how do search applications handle ittoday?&lt;br&gt;&lt;br&gt;One obvious "solution" is to simply use a relational database insteadof a search engine!  If relevance scores are less important and youneed to do substantial joining, grouping, sorting, etc., then using adatabase could be best overall.  Most databases include some form atext search, some even using Lucene.&lt;br&gt;&lt;br&gt;If you still want to use a search engine, then one common approachis to&lt;a href="http://en.wikipedia.org/wiki/Denormalization"&gt;&lt;em&gt;denormalize&lt;/em&gt;&lt;/a&gt;the content up front, at index-time, by joining all tables andindexing the resulting rows, duplicating content in the process.  Forexample, you'd index each song as a Lucene document, copying over allfields from the song's joined album and artist/band.  This workscorrectly, but can be horribly wasteful as you are indexing identicalfields, possibly including large text fields, over and over.&lt;br&gt;&lt;br&gt;Another approach is to do the join yourself, outside ofLucene, by indexing songs, albums and artist/band as separate Lucenedocuments, perhaps even in separate indices.  At search-time, youfirst run a query against one collection, for example the songs.  Thenyou iterate through &lt;b&gt;all&lt;/b&gt; hits, gathering up (joining) the fullset of corresponding albums and then run a second query against thealbums, with a large OR'd list of the albums from the first query,repeating this process if you need to join to artist/band as well.This approach will also work, but doesn't scale well as you may haveto create possibly immense follow-on queries.&lt;br&gt;&lt;br&gt;Yet another approach is to use a software package that hasalready implemented one of theseapproaches for you!  &lt;a href="http://www.elasticsearch.org/"&gt;elasticsearch&lt;/a&gt;,&lt;a href="http://lucene.apache.org/solr/"&gt;Apache  Solr&lt;/a&gt;, &lt;a href="http://jackrabbit.apache.org/"&gt;Apache  Jackrabbit&lt;/a&gt;, &lt;a href="http://www.hibernate.org/subprojects/search.html"&gt;Hibernate  Search&lt;/a&gt; and many others all handle relational content in some way.&lt;br&gt;&lt;br&gt;With &lt;code&gt;BlockJoinQuery&lt;/code&gt; you can now directly searchrelational content yourself!&lt;br&gt;&lt;br&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://www.amazon.com/Mountain-Three-Wolf-Short-Sleeve/dp/B002HJ377A" style="clear:right; float:right;"&gt;&lt;img border="0" height="358" width="400" src="http://1.bp.blogspot.com/-OCajpCiqBNA/TwmzR95meRI/AAAAAAAAAKs/bSZ2qZMj1F8/s400/threeWolf.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;Let's work through a simple example: imagine you sell shirts online.Each shirt has certain common fields such as name, description,fabric, price, etc.  For each shirt you have a number ofseparate &lt;a href="http://en.wikipedia.org/wiki/Stock-keeping_unit"&gt;stockkeeping units&lt;/a&gt; or SKUs, which have their own fields like size,color, inventory count, etc.  The SKUs are what you actually sell, andwhat you must stock, because when someone buys a shirt they buy aspecific SKU (size and color).&lt;br&gt;&lt;br&gt;Maybe you are &lt;a href="http://news.bbc.co.uk/2/hi/8061031.stm"&gt;lucky enough&lt;/a&gt; to sell the incredible&lt;a href="http://www.amazon.com/Mountain-Three-Wolf-Short-Sleeve/dp/B002HJ377A"&gt;Mountain Three-wolf Moon Short Sleeve Tee&lt;/a&gt;, with these SKUs (size, color):&lt;ul&gt;  &lt;li&gt; small, blue  &lt;li&gt; small, black  &lt;li&gt; medium, black  &lt;li&gt; large, gray&lt;/ul&gt;Perhaps a user first searches for "wolf shirt", gets a bunch of hits,and then drills down on a particular size and color, resulting in thisquery:&lt;pre&gt;&lt;br /&gt;   name:wolf AND size=small AND color=blue&lt;br /&gt;&lt;/pre&gt;which should match this shirt. &lt;code&gt;name&lt;/code&gt; is a shirt fieldwhile the &lt;code&gt;size&lt;/code&gt; and &lt;code&gt;color&lt;/code&gt; are SKU fields.&lt;br&gt;&lt;br&gt;But if the user drills down instead on a small gray shirt:&lt;pre&gt;&lt;br /&gt;   name:wolf AND size=small AND color=gray&lt;br /&gt;&lt;/pre&gt;then this shirt should not match because the small size only comes inblue and black.&lt;br&gt;&lt;br&gt; How can you run these queriesusing &lt;code&gt;BlockJoinQuery&lt;/code&gt;?  Start by indexing each shirt(parent) and all of its SKUs (children) as separate documents, usingthe new &lt;code&gt;IndexWriter.addDocuments&lt;/code&gt; API to add one shirt andall of its SKUs as a single &lt;em&gt;document block&lt;/em&gt;.  This methodatomically adds a block of documents into a single segment as adjacentdocument IDs, which &lt;code&gt;BlockJoinQuery&lt;/code&gt; relies on. You shouldalso add a marker field to each shirt document (e.g. &lt;code&gt;type =shirt&lt;/code&gt;), as &lt;code&gt;BlockJoinQuery&lt;/code&gt; requiresa &lt;code&gt;Filter&lt;/code&gt; identifying the parent documents.&lt;br&gt;&lt;br&gt;To run a &lt;code&gt;BlockJoinQuery&lt;/code&gt; at search-time, you'll first needto create the &lt;em&gt;parent filter&lt;/em&gt;, matching only shirts.Note that the filter must use &lt;code&gt;FixedBitSet&lt;/code&gt;under the hood, like &lt;code&gt;CachingWrapperFilter&lt;/code&gt;:&lt;pre&gt;&lt;br /&gt;  Filter shirts = new CachingWrapperFilter(&lt;br /&gt;                    new QueryWrapperFilter(&lt;br /&gt;                      new TermQuery(&lt;br /&gt;                        new Term("type", "shirt"))));&lt;br /&gt;&lt;/pre&gt;Create this filter once, up front and re-use it any time you need toperform this join.&lt;br&gt;&lt;br&gt;Then, for each query that requires a join, because it involvesboth SKU and shirt fields, start with the child query matching onlySKU fields:&lt;pre&gt;&lt;br /&gt;  BooleanQuery skuQuery = new BooleanQuery();&lt;br /&gt;  skuQuery.add(new TermQuery(new Term("size", "small")), Occur.MUST);&lt;br /&gt;  skuQuery.add(new TermQuery(new Term("color", "blue")), Occur.MUST);&lt;br /&gt;&lt;/pre&gt;Next, use &lt;code&gt;BlockJoinQuery&lt;/code&gt; to translate hits from the SKUdocument space up to the shirt document space:&lt;pre&gt;&lt;br /&gt;  BlockJoinQuery skuJoinQuery = new BlockJoinQuery(&lt;br /&gt;    skuQuery, &lt;br /&gt;    shirts,&lt;br /&gt;    ScoreMode.None);&lt;br /&gt;&lt;/pre&gt;The &lt;code&gt;ScoreMode&lt;/code&gt; enum decides how scores for multiple SKUhits should be aggregated to the score for the corresponding shirthit.  In this query you don't need scores from the SKU matches, but ifyou did you can aggregatewith &lt;code&gt;Avg&lt;/code&gt;, &lt;code&gt;Max&lt;/code&gt; or &lt;code&gt;Total&lt;/code&gt; instead.&lt;br&gt;&lt;br&gt;Finally you are now free to build up an arbitrary shirt queryusing &lt;code&gt;skuJoinQuery&lt;/code&gt; as a clause:&lt;pre&gt;&lt;br /&gt;  BooleanQuery query = new BooleanQuery();&lt;br /&gt;  query.add(new TermQuery(new Term("name", "wolf")), Occur.MUST);&lt;br /&gt;  query.add(skuJoinQuery, Occur.MUST);&lt;br /&gt;&lt;/pre&gt;You could also just run &lt;code&gt;skuJoinQuery&lt;/code&gt; as-is if the querydoesn't have any shirt fields.&lt;br&gt;&lt;br&gt;Finally, just run this &lt;code&gt;query&lt;/code&gt; like normal!  Thereturned hits will be only shirt documents; if you'd also like to seewhich SKUs  matched for each shirt,use &lt;code&gt;BlockJoinCollector&lt;/code&gt;:&lt;pre&gt;&lt;br /&gt;  BlockJoinCollector c = new BlockJoinCollector(&lt;br /&gt;    Sort.RELEVANCE, // sort&lt;br /&gt;    10,             // numHits&lt;br /&gt;    true,           // trackScores&lt;br /&gt;    false           // trackMaxScore&lt;br /&gt;    );&lt;br /&gt;  searcher.search(query, c);&lt;br /&gt;&lt;/pre&gt;The provided &lt;code&gt;Sort&lt;/code&gt; must use only shirt fields (you cannotsort by any SKU fields).  When each hit (a shirt) is competitive, thiscollector will also record all SKUs that matched for that shirt, whichyou can retrieve like this:&lt;pre&gt;&lt;br /&gt;  TopGroups&lt;Integer&gt; hits = c.getTopGroups(&lt;br /&gt;    skuJoinQuery,&lt;br /&gt;    skuSort,&lt;br /&gt;    0,   // offset&lt;br /&gt;    10,  // maxDocsPerGroup&lt;br /&gt;    0,   // withinGroupOffset&lt;br /&gt;    true // fillSortFields&lt;br /&gt;  );&lt;br /&gt;&lt;/pre&gt;Set &lt;code&gt;skuSort&lt;/code&gt; to the sort order for the SKUs within eachshirt.  The first &lt;code&gt;offset&lt;/code&gt; hits are skipped (use this forpaging through shirt hits).  Under each shirt, atmost &lt;code&gt;maxDocsPerGroup&lt;/code&gt; SKUs will be returned.Use &lt;code&gt;withinGroupOffset&lt;/code&gt; if you want to page within theSKUs.  If &lt;code&gt;fillSortFields&lt;/code&gt; is true then each SKU hit willhave values for the fields from &lt;code&gt;skuSort&lt;/code&gt;.&lt;br&gt;&lt;br&gt;The hits returned by &lt;code&gt;BlockJoinCollector.getTopGroups&lt;/code&gt;are SKU hits, grouped by shirt.  You'd get the exact same results ifyou had denormalized up-front and then used grouping to group resultsby shirt.&lt;br&gt;&lt;br&gt;You can also do more than one join in a single query; the joins can benested (parent to child to grandchild) or parallel (parent to child1and parent to child2).&lt;br&gt;&lt;br&gt;However, there are some important limitations of index-time joins:&lt;ul&gt;  &lt;li&gt; The join must be computed at index-time and "compiled" into the    index, in that all joined child documents must be indexed along    with the parent document, as a single document block.    &lt;br&gt;    &lt;br&gt;  &lt;li&gt; Different document types (for example, shirts and SKUs) must    share a single index, which is wasteful as it means non-sparse    data structures like &lt;code&gt;FieldCache&lt;/code&gt; entries consume more    memory than they would if you had separate indices.    &lt;br&gt;    &lt;br&gt;  &lt;li&gt; If you need to re-index a parent document or any of its child    documents, or delete or add a child, then the entire block must be    re-indexed.  This is a big problem in some cases, for example if    you index "user reviews" as child documents then whenever a user    adds a review you'll have to re-index that shirt as well as all    its SKUs and user reviews.    &lt;br&gt;    &lt;br&gt;  &lt;li&gt; There is no &lt;code&gt;QueryParser&lt;/code&gt; support, so you need to    programmatically create the parent and child queries,    separating according to parent and child fields.    &lt;br&gt;    &lt;br&gt;  &lt;li&gt; The join can currently only go in one direction (mapping child    docIDs to parent docIDs), but in some cases you need to map parent    docIDs to child docIDs.  For example, when searching songs,    perhaps you want all matching songs sorted by their title.  You    can't easily do this today because the only way to get song hits    is to group by album or band/artist.    &lt;br&gt;    &lt;br&gt;  &lt;li&gt; The join is a one (parent) to many (children), inner join.&lt;/ul&gt;As usual, patches are welcome!&lt;br&gt;&lt;br&gt;Thereis &lt;a href="https://issues.apache.org/jira/browse/LUCENE-3602"&gt;workunderway&lt;/a&gt; to create a more flexible, but likely less performant,query-time join capability, which should address a number of the abovelimitations.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-6483188241817199309?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/6483188241817199309/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6483188241817199309'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6483188241817199309'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html' title='Searching relational content with Lucene&apos;s BlockJoinQuery'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-OCajpCiqBNA/TwmzR95meRI/AAAAAAAAAKs/bSZ2qZMj1F8/s72-c/threeWolf.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-3558948251715770535</id><published>2011-11-10T06:32:00.000-05:00</published><updated>2011-11-10T06:32:22.728-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>SearcherLifetimeManager prevents a broken search user experience</title><content type='html'>&lt;br/&gt;&lt;br/&gt;In the past, search indices were usually very static: you built themonce, called &lt;code&gt;optimize&lt;/code&gt; at the end and shipped them off,and didn't change them very often.&lt;br/&gt;&lt;br/&gt;But these days it's just the opposite: most applications have verydynamic indices, constantly being updated with a stream of changes,and you &lt;em&gt;never&lt;/em&gt; call &lt;code&gt;optimize&lt;/code&gt; anymore.&lt;br/&gt;&lt;br/&gt;Lucene's near-real-time search, especially with recent improvementsincluding &lt;a href="http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html"&gt;managerclasses&lt;/a&gt; to handle the tricky complexities of sharing searchersacross threads, offers very fast search turnaround on index changes.&lt;br/&gt;&lt;br/&gt;But there is a serious yet often overlooked problem with thisapproach.  To see it, you have to put yourself in the shoes of a user.Imagine Alice comes to your site, runs a search, and is lookingthrough the search results.  Not satisfied, after a few seconds shedecides to refine that first search.  Perhaps she drills down on oneof the nice facets you presented, or maybe she clicks to the nextpage, or picks a different sort criteria (any follow-on action willdo).  So a new search request is sent back to your server, includingthe first search plus the requested change (drill down, next page,change sort field, etc.).&lt;br/&gt;&lt;br/&gt;How do you handle this follow-on search request?  Just pull the latestand greatest searcher fromyour &lt;a href="http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html"&gt;&lt;code&gt;SearcherManager&lt;/code&gt;&lt;/a&gt;or &lt;a href="http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html"&gt;&lt;code&gt;NRTManager&lt;/code&gt;&lt;/a&gt;and search away, right?&lt;br/&gt;&lt;br/&gt;Wrong!&lt;br/&gt;&lt;br/&gt;If you do this, you risk a broken search experience for Alice, becausethe new searcher may be different from the original searcher used forAlice's first search request.  The differences could be substantial,if you had just opened a new searcher after updating a bunch ofdocuments.  This means the results of Alice's follow-on search mayhave shifted: facet counts are now off, hits are sorted differently sosome hits may be duplicated on the second page, or may be lost (ifthey moved from page 2 to page 1), etc.  If you use the new (will bein Lucene3.5.0) &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2215"&gt;&lt;code&gt;searchAfter&lt;/code&gt;&lt;/a&gt;API, for efficient paging, the risk is even greater!&lt;br/&gt;&lt;br/&gt;Perversely, the frequent searcher reopening that you thought providessuch a great user experience by making all search results so fresh,can in fact have just the opposite effect.  Each reopen risks breakingall current searches in your application; the more activeyour site, the more searches you might break!&lt;br/&gt;&lt;br/&gt;It's deadly to intentionally break a user's search experience: theywill (correctly) conclude your search is buggy, eroding their trust,and then take their business to your competition.&lt;br/&gt;&lt;br/&gt;It turns out, this is easy to fix!  Instead of pulling the latestsearcher for every incoming search request, you should try to pull thesame searcher used for the initial search request in the session.This way all follow-on searches see exactly the same index.&lt;br/&gt;&lt;br/&gt;Fortunately, there's a new class coming in Lucene 3.5.0, thatsimplifies this: &lt;code&gt;SearcherLifetimeManager&lt;/code&gt;.  The class isagnostic to how you obtain the fresh searchers(i.e., &lt;code&gt;SearcherManager&lt;/code&gt;, &lt;code&gt;NRTManager&lt;/code&gt;, or yourown custom source) used for an initial search.Just likeLucene's &lt;a href="http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html"&gt;other  manager classes&lt;/a&gt;, &lt;code&gt;SearcherLifetimeManager&lt;/code&gt; is veryeasy to use.  Create the manager once, up front:&lt;pre&gt;&lt;br /&gt;  SearcherLifetimeManager mgr = new SearcherLifetimeManager();&lt;br /&gt;&lt;/pre&gt;Then, when a search request arrives, if it's an initial (notfollow-on) search, obtain the most current searcherin &lt;a href="http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html"&gt;theusual way&lt;/a&gt;, but then record this searcher:&lt;pre&gt;&lt;br /&gt;  long token = mgr.record(searcher);&lt;br /&gt;&lt;/pre&gt;The returned &lt;code&gt;token&lt;/code&gt; uniquely identifies the specificsearcher; you must save it somewhere the user's search results, forexample by placing it in a hidden HTML form field.&lt;br/&gt;&lt;br/&gt;Later, when the user performs a follow-on search request, make surethe original &lt;code&gt;token&lt;/code&gt; is sent back to the server, and thenuse it to obtain the same searcher:&lt;pre&gt;&lt;br /&gt;  // If possible, obtain same searcher version as last&lt;br /&gt;  // search:&lt;br /&gt;  IndexSearcher searcher = mgr.acquire(token);&lt;br /&gt;  if (searcher != null) {&lt;br /&gt;    // Searcher is still here&lt;br /&gt;    try {&lt;br /&gt;      // do searching...&lt;br /&gt;    } finally {&lt;br /&gt;      mgr.release(searcher);&lt;br /&gt;      // Do not use searcher after this!&lt;br /&gt;      searcher = null;&lt;br /&gt;    }&lt;br /&gt;  } else {&lt;br /&gt;    // Searcher was pruned -- notify user session timed&lt;br /&gt;    // out&lt;br /&gt;  }&lt;br /&gt;&lt;/pre&gt;As long as the original searcher is still available, the manager willreturn it to you; be sure to &lt;code&gt;release&lt;/code&gt; that searcher(ideally in a &lt;code&gt;finally&lt;/code&gt; clause).&lt;br/&gt;&lt;br/&gt;It's possible searcher is no longer available: for example if Aliceran a new search, but then got hungry, went off to a long lunch, andfinally returned then clicked "next page", likely the originalsearcher will have been pruned!&lt;br/&gt;&lt;br/&gt;You should gracefully handle this case, for example by notifying Alicethat the search had timed out and asking her to re-submit the originalsearch (which will then get the latest and greatest searcher).Fortunately, you can reduce how often this happens, by controlling howaggressively you prune old searchers:&lt;pre&gt;&lt;br /&gt;  mgr.prune(new PruneByAge(600.0));&lt;br /&gt;&lt;/pre&gt;This removes any searchers older than 10 minutes (you can alsoimplement a custom pruning strategy).  You should call it from aseparate dedicated thread (not a searcher thread), ideally the samethread that's periodically indexing changes and opening new searchers.&lt;br/&gt;&lt;br/&gt;Keeping many searchers around will necessarily tie up resources (openfile descriptors, RAM, index files on disk thatthe &lt;code&gt;IndexWriter&lt;/code&gt; would otherwise have deleted).  However,because the reopened searchers share sub-readers, the resourceconsumption will generally be well contained, in proportion to howmany index changes occurred between each reopen.  Just be sure touse &lt;code&gt;NRTCachingDirectory&lt;/code&gt;, to ensure you don't bump upagainst open file descriptor limits on your operating system (thisalso gives a good speedup in reopen turnaround time).&lt;br/&gt;&lt;br/&gt;Don't erode your users' trust by intentionally breaking theirsearches!&lt;br/&gt;&lt;br/&gt;&lt;a href="https://issues.apache.org/jira/browse/LUCENE-3486"&gt;LUCENE-3486&lt;/a&gt;has the details.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-3558948251715770535?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/3558948251715770535/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/11/searcherlifetimemanager-prevents-broken.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3558948251715770535'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3558948251715770535'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/11/searcherlifetimemanager-prevents-broken.html' title='SearcherLifetimeManager prevents a broken search user experience'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5694774165455660309</id><published>2011-11-03T14:12:00.000-04:00</published><updated>2011-11-03T14:12:06.087-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Near-real-time readers with Lucene's SearcherManager and NRTManager</title><content type='html'>&lt;a href="http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html"&gt;Lasttime&lt;/a&gt;, I described the useful &lt;code&gt;SearcherManager&lt;/code&gt; class,coming in the next (3.5.0) Lucene release, to periodically reopen your&lt;code&gt;IndexSearcher&lt;/code&gt; when multiple threads need to share it.This class presents a verysimple &lt;code&gt;acquire&lt;/code&gt;/&lt;code&gt;release&lt;/code&gt; API, hiding thethread-safe complexities of opening and closing theunderlying &lt;code&gt;IndexReader&lt;/code&gt;s.&lt;br&gt;&lt;br&gt;But that example used a non near-real-time (NRT)&lt;code&gt;IndexReader&lt;/code&gt;, which has relatively high turnaround timefor index changes to become visible, since you must call&lt;code&gt;IndexWriter.commit&lt;/code&gt; first.&lt;br&gt;&lt;br&gt;If you have access to the &lt;code&gt;IndexWriter&lt;/code&gt; that's activelychanging the index (i.e., it's in the same JVM as your searchers), usean NRT reader instead!  NRT readers let youdecouple &lt;em&gt;durability&lt;/em&gt; to hardware/OS crashesfrom &lt;em&gt;visibility&lt;/em&gt; of changes to a new &lt;code&gt;IndexReader&lt;/code&gt;.How frequently you commit (for durability) and how frequently youreopen (to see new changes) become fully separate decisions.This &lt;em&gt;controlledconsistency model&lt;/em&gt; that Lucene exposes is a nice "best of bothworlds" blend between thetraditional &lt;a href="http://en.wikipedia.org/wiki/Immediate_consistency"&gt;immediate&lt;/a&gt;and &lt;a href="http://en.wikipedia.org/wiki/Eventual_consistency"&gt;eventual&lt;/a&gt;consistency models.&lt;br&gt;&lt;br&gt;Since reopening an NRT reader bypasses the costly commit, and sharessome data structures directly in RAM instead of writing/readingto/from files, itprovides &lt;a href="http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html"&gt;extremelyfast turnaround time&lt;/a&gt; on making index changes visible to searchers.Frequent reopens such as every 50 milliseconds, even under relativelyhigh indexing rates, is easily achievable on modern hardware.&lt;br&gt;&lt;br&gt;Fortunately, it's trivial to use &lt;code&gt;SearcherManager&lt;/code&gt; with NRTreaders: use the constructor that takes &lt;code&gt;IndexWriter&lt;/code&gt;instead of &lt;code&gt;Directory&lt;/code&gt;:&lt;pre&gt;&lt;br /&gt;  boolean applyAllDeletes = true;&lt;br /&gt;  ExecutorService es = null;&lt;br /&gt;  SearcherManager mgr = new SearcherManager(writer, applyAllDeletes,&lt;br /&gt;                                            new MySearchWarmer(), es);&lt;br /&gt;&lt;/pre&gt;This tells &lt;code&gt;SearcherManager&lt;/code&gt; that its source for new&lt;code&gt;IndexReader&lt;/code&gt;s is the provided &lt;code&gt;IndexWriter&lt;/code&gt;instance (instead of a &lt;code&gt;Directory&lt;/code&gt; instance).  After that,use the &lt;code&gt;SearcherManager&lt;/code&gt; &lt;a href="http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html"&gt;justas before&lt;/a&gt;.&lt;br&gt;&lt;br&gt;Typically you'll set the &lt;code&gt;applyAllDeletes&lt;/code&gt; boolean to&lt;code&gt;true&lt;/code&gt;, meaning each reopened reader is required to applyall previous deletion operations (&lt;code&gt;deleteDocuments&lt;/code&gt;or &lt;code&gt;updateDocument/s&lt;/code&gt;) up until that point.&lt;br&gt;&lt;br&gt;Sometimes your usage won't require deletions to be applied.  Forexample, perhaps you index multiple versions of each document overtime, always deleting the older versions, yet during searching youhave some way to ignore the old versions.  If that's the case, you canpass &lt;code&gt;applyAllDeletes=false&lt;/code&gt; instead.  This will make theturnaround time quite a bit faster, as the primary-key lookupsrequired to resolve deletes can be costly.  However, if you're usingLucene's trunk (to be eventually released as 4.0), another option isto use &lt;code&gt;MemoryCodec&lt;/code&gt; on your &lt;code&gt;id&lt;/code&gt; fieldto &lt;a href="http://blog.mikemccandless.com/2011/06/primary-key-lookups-are-28x-faster-with.html"&gt;greatlyreduce the primary-key lookup time&lt;/a&gt;.&lt;br&gt;&lt;br&gt;Note that some or even all of the previous deletes may still beapplied even if you pass &lt;code&gt;false&lt;/code&gt;.  Also, the pendingdeletes are never &lt;em&gt;lost&lt;/em&gt; if you pass &lt;code&gt;false&lt;/code&gt;: theyremain buffered and will still eventually be applied.&lt;br&gt;&lt;br&gt;If you have some searches that can tolerate unapplied deletes andothers that cannot, it's perfectly fine to create two&lt;code&gt;SearcherManager&lt;/code&gt;s, one applying deletes and one not.&lt;br&gt;&lt;br&gt;If you pass a non-null &lt;code&gt;ExecutorService&lt;/code&gt;, then each segmentin the index can be searched concurrently; this is a way to gainconcurrency within a single search request.  Most applications do notrequire this, because the concurrency across multiple searches issufficient.  It's also not clear that this is effective in general asit adds per-segment overhead, and the available concurrency is afunction of your index structure.  Perversely, a fully optimized indexwill have no concurrency!  Most applications should pass&lt;code&gt;null&lt;/code&gt;.&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;code&gt;&lt;font size=+2&gt;&lt;b&gt;NRTManager&lt;/b&gt;&lt;/font&gt;&lt;/code&gt;&lt;br&gt;&lt;br&gt;What if you want the fast turnaround time of NRT readers, but needcontrol over when specific index changes become visible to certainsearches?  Use &lt;code&gt;NRTManager&lt;/code&gt;!&lt;br&gt;&lt;br&gt;&lt;code&gt;NRTManager&lt;/code&gt; holds onto the &lt;code&gt;IndexWriter&lt;/code&gt;instance you provide and then exposes the same APIs for making indexchanges (&lt;code&gt;addDocument/s&lt;/code&gt;, &lt;code&gt;updateDocument/s&lt;/code&gt;,&lt;code&gt;deleteDocuments&lt;/code&gt;).  These methods forward to theunderlying &lt;code&gt;IndexWriter&lt;/code&gt;, but then return a&lt;em&gt;generation&lt;/em&gt; token (a Java &lt;code&gt;long&lt;/code&gt;) which you canhold onto after making any given change.  The generation onlyincreases over time, so if you make a group of changes, just keep thegeneration returned from the last change you made.&lt;br&gt;&lt;br&gt;Then, when a given search request requires certain changes to bevisible, pass that generation back to&lt;code&gt;NRTManager&lt;/code&gt; to obtain a searcher that's guaranteed toreflect all changes for that generation.&lt;br&gt;&lt;br&gt;Here's one example use-case: let's say your site has a forum, and youuse Lucene to index and search all posts in the forum.  Suddenly auser, Alice, comes online and adds a new post; in your server, youtake the text from Alice's post and add it as a document to the index,using&lt;code&gt;NRTManager.addDocument&lt;/code&gt;, saving the returned generation.If she adds multiple posts, just keep the last generation.&lt;br&gt;&lt;br&gt;Now, if Alice stops posting and runs a search, you'd like to ensureher search covers all the posts she just made.  Of course, if yourreopen time is fast enough (say once per second), unless Alicetypes &lt;em&gt;very&lt;/em&gt; quickly, any search she runs will already reflecther posts.&lt;br&gt;&lt;br&gt;But pretend for now you reopen relatively infrequently (say once every5 or 10 seconds), and you need to be certain Alice's search covers herposts, so you call &lt;code&gt;NRTManager.waitForGeneration&lt;/code&gt; to obtainthe &lt;code&gt;SearcherManager&lt;/code&gt; to use for searching.  If the latestsearcher already covers the requested generation, the method returnsimmediately.  Otherwise, it blocks, requesting a reopen (see below),until the required generation has become visible in a searcher, andthen returns it.&lt;br&gt;&lt;br&gt;If some other user, say Bob, doesn't add any posts and runs a search,you don't need to wait for Alice's generation to be visible whenobtaining the searcher, since it's far less important when Alice'schanges become immediately visible to Bob.  There's (usually!) nocausal connection between Alice posting and Bob searching, so it'sfine for Bob to use the most recent searcher.&lt;br&gt;&lt;br&gt;Another use-case is an index verifier, where you index a document andthen immediately search for it to perform end-to-end validation thatthe document "made it" correctly into the index.  That immediatesearch must first wait for the returned generation to becomeavailable.&lt;br&gt;&lt;br&gt;The power of &lt;code&gt;NRTManager&lt;/code&gt; is you have full control overwhich searches must see the effects of which indexing changes; this isa further improvement in Lucene's controlled consistencymodel. &lt;code&gt;NRTManager&lt;/code&gt; hides all the tricky details oftracking generations.&lt;br&gt;&lt;br&gt;But: don't abuse this!  You may be tempted to always wait for lastgeneration you indexed for all searches, but this would result in verylow search throughput on concurrent hardware since all searches wouldbunch up, waiting for reopens.  With proper usage, only a small subsetof searches should need to wait for a specific generation, like Alice;the rest will simply use the most recent searcher, like Bob.&lt;br&gt;&lt;br&gt;Managing reopens is a little trickier with &lt;code&gt;NRTManager&lt;/code&gt;,since you should reopen at higher frequency whenever a search iswaiting for a specific generation.  To address this, there's theuseful &lt;code&gt;NRTManagerReopenThread&lt;/code&gt; class; use it like this:&lt;pre&gt;&lt;br /&gt;  double minStaleSec = 0.025;&lt;br /&gt;  double maxStaleSec = 5.0;&lt;br /&gt;  NRTManagerReopenThread thread = new NRTManagerReopenThread(&lt;br /&gt;                                       nrtManager,&lt;br /&gt;           maxStaleSec,&lt;br /&gt;           minStaleSec);&lt;br /&gt;  thread.start();&lt;br /&gt;  ...&lt;br /&gt;  thread.close();&lt;br /&gt;&lt;/pre&gt;The &lt;code&gt;minStaleSec&lt;/code&gt; sets an upper bound on how frequentlyreopens should occur.  This is used whenever a searcher is waiting fora specific generation (Alice, above), meaning the longest such a searchshould have to wait is approximately 25 msec. &lt;br&gt;&lt;br&gt;The &lt;code&gt;maxStaleSec&lt;/code&gt; sets a lower bound on how frequentlyreopens should occur.  This is used for the periodic "ordinary"reopens, when there is no request waiting for a specific generation(Bob, above); this means any changes done to the index more thanapproximately 5.0 seconds ago will be seen when Bob searches.  Notethat these parameters are approximate targets and not hard guaranteeson the reader turnaround time.  Be sure to eventuallycall &lt;code&gt;thread.close()&lt;/code&gt;, when you are done reopening (forexample, on shutting down the application).&lt;br&gt;&lt;br&gt;You are also free to use your own strategy forcalling &lt;code&gt;maybeReopen&lt;/code&gt;; you don't have to use &lt;code&gt;NRTManagerReopenThread&lt;/code&gt;.  Just remember that gettingit right, especially when searches are waiting for specificgenerations, can be tricky!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5694774165455660309?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5694774165455660309/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5694774165455660309'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5694774165455660309'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html' title='Near-real-time readers with Lucene&apos;s SearcherManager and NRTManager'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7288601351161582082</id><published>2011-10-25T11:55:00.001-04:00</published><updated>2011-10-25T11:55:43.003-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Compact Language Detector'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><title type='text'>Accuracy and performance of Google's Compact Language Detector</title><content type='html'>To get a sense of the accuracy and performance of Google's &lt;ahref="http://code.google.com/p/chromium-compact-language-detector"&gt;CompactLanguage Detector&lt;/a&gt;, I ran some tests against two other packages:  &lt;br/&gt;  &lt;br/&gt;&lt;ul&gt;  &lt;li&gt; &lt;a href="http://tika.apache.org"&gt;Apache Tika&lt;/a&gt;, implemented  in Java, using its &lt;a href="http://tika.apache.org/0.10/api/org/apache/tika/language/LanguageIdentifier.html"&gt;LanguageIdentification&lt;/a&gt; class  &lt;li&gt; &lt;a href="http://code.google.com/p/language-detection"&gt;&lt;code&gt;language-detection&lt;/code&gt;&lt;/a&gt;,  a project on &lt;a href="http://code.google.com"&gt;Google code&lt;/a&gt;, also implemented in Java  &lt;br/&gt;&lt;/ul&gt;  &lt;br/&gt;For the test corpus I used a &lt;ahref="http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/"&gt;thecorpus described here&lt;/a&gt;, created by the author of&lt;code&gt;language-detection&lt;/code&gt;.  It contains 1000 texts from each of21 languages, randomly sampled from the &lt;ahref="http://www.statmt.org/europarl"&gt;Europarl corpus&lt;/a&gt;.&lt;br/&gt;&lt;br/&gt;It's not a perfect test (no test ever is!): the content is alreadyvery clean plain text; there are no domain, language, encoding hintsto apply (which you'd normally have with HTML content loaded overHTTP); it "only" covers 21 languages (versus at least 76 that CLD candetect).&lt;br/&gt;&lt;br/&gt;CLD and &lt;code&gt;language-detection&lt;/code&gt; cover all 21 languages, butTika is missing Bulgarian (&lt;code&gt;bg&lt;/code&gt;), Czech (&lt;code&gt;cs&lt;/code&gt;),Lithuanian (&lt;code&gt;lt&lt;/code&gt;) and Latvian (&lt;code&gt;lv&lt;/code&gt;), so I onlytested on the remaining subset of 17 languages that all three detectorssupport.  This works out to 17,000 texts totalling 2.8 MB.&lt;br/&gt;&lt;br/&gt;Many of the texts are very short, making the test challenging: theshortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes orless.&lt;br/&gt;&lt;br/&gt;In addition to the challenges of the corpora, the differences in thedetectors make the comparison somewhat apples to oranges.  Forexample, CLD detects at least 76 languages, while&lt;code&gt;language-detection&lt;/code&gt; detects 53 and Tika detects 27,so this biases against CLD, and &lt;code&gt;language-detection&lt;/code&gt; to alesser extent, since their classification task is harder relative toTika's.&lt;br/&gt;&lt;br/&gt;For CLD, I disabled its &lt;ahref="http://blog.mikemccandless.com/2011/10/additions-to-compact-language-detector.html"&gt;optionto abstain&lt;/a&gt; (&lt;code&gt;removeWeakMatches&lt;/code&gt;), so that it alwaysguesses at the language even when confidence is low, to match theother two detectors.  I also turned off the&lt;code&gt;pickSummaryLanguage&lt;/code&gt;, as this was also hurting accuracy;now CLD simply picks the highest scoring match as the detectedlanguage.&lt;br/&gt;&lt;br/&gt;For &lt;code&gt;language-detection&lt;/code&gt;, I ran with the default&lt;code&gt;ALPHA&lt;/code&gt; of 0.5, and set the random seed to 0.&lt;br/&gt;&lt;br/&gt;Here are the raw results:&lt;br/&gt;&lt;br/&gt;CLD results (total 98.82% = 16800 / 17000):&lt;br/&gt;&lt;font face=Courier&gt;&lt;table border=0&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;da&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;93.4%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=934&lt;/td&gt;&lt;td&gt;&amp;nbsp;nb=54&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=5&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;eu=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;is=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;hr=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;de&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.6%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=996&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;ga=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;cy=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;el&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;100.0%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;el=1000&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;en&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;100.0%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1000&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;es&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;98.3%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;es=983&lt;/td&gt;&lt;td&gt;&amp;nbsp;pt=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;gl=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;eu=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;id=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;et&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.6%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=996&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;id=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;fi&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;100.0%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=1000&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;fr&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.2%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=992&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;sq=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;hu&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.9%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=999&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;it&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.5%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=995&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;mt=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;id=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;eu=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;nl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.5%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=995&lt;/td&gt;&lt;td&gt;&amp;nbsp;af=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;pl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.6%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;pl=996&lt;/td&gt;&lt;td&gt;&amp;nbsp;tr=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;sw=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;nb=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;pt&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;98.7%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;pt=987&lt;/td&gt;&lt;td&gt;&amp;nbsp;gl=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;es=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;mt=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;is=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;ht=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;ro&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.8%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=998&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sk&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;98.8%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=988&lt;/td&gt;&lt;td&gt;&amp;nbsp;cs=9&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;95.1%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=951&lt;/td&gt;&lt;td&gt;&amp;nbsp;hr=32&lt;/td&gt;&lt;td&gt;&amp;nbsp;sr=8&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=5&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;id=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;cs=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sv&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.0%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=990&lt;/td&gt;&lt;td&gt;&amp;nbsp;nb=9&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;Tika results (total 97.12% = 16510 / 17000):&lt;br/&gt;&lt;font face=Courier&gt;&lt;table border=0&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;da&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;87.6%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=876&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=112&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;de&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;98.5%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=985&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;el&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;100.0%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;el=1000&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;en&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;96.9%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=969&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=10&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=6&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;es&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;89.8%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;es=898&lt;/td&gt;&lt;td&gt;&amp;nbsp;gl=47&lt;/td&gt;&lt;td&gt;&amp;nbsp;pt=22&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=15&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=6&lt;/td&gt;&lt;td&gt;&amp;nbsp;eo=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;et&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.1%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=991&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;fi&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.4%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=994&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=5&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;fr&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;98.0%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=980&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=6&lt;/td&gt;&lt;td&gt;&amp;nbsp;eo=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;gl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;es=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;hu&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.9%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=999&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;it&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.4%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=994&lt;/td&gt;&lt;td&gt;&amp;nbsp;eo=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;pt=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;nl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;97.8%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=978&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=8&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;pl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;gl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;pl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.1%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;pl=991&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;pt&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;94.4%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;pt=944&lt;/td&gt;&lt;td&gt;&amp;nbsp;gl=48&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;es=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;ro&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.3%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=993&lt;/td&gt;&lt;td&gt;&amp;nbsp;is=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;pl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sk&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;96.2%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=962&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=21&lt;/td&gt;&lt;td&gt;&amp;nbsp;pl=13&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;98.5%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=985&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=7&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;pt=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sv&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;97.1%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=971&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=15&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=6&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=6&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;ca=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;&lt;code&gt;Language-detection&lt;/code&gt; results (total 99.22% = 16868 / 17000):&lt;br/&gt;&lt;font face=Courier&gt;&lt;table border=0&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;da&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;97.1%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=971&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=28&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;de&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.8%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=998&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;af=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;el&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;100.0%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;el=1000&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;en&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.7%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=997&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;af=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;es&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.5%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;es=995&lt;/td&gt;&lt;td&gt;&amp;nbsp;pt=4&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;et&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.6%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=996&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;af=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;fi&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.8%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;fi=998&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;fr&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.8%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=998&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;hu&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.9%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=999&lt;/td&gt;&lt;td&gt;&amp;nbsp;id=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;it&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.8%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=998&lt;/td&gt;&lt;td&gt;&amp;nbsp;es=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;nl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;97.7%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=977&lt;/td&gt;&lt;td&gt;&amp;nbsp;af=21&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;de=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;pl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.9%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;pl=999&lt;/td&gt;&lt;td&gt;&amp;nbsp;nl=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;pt&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.4%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;pt=994&lt;/td&gt;&lt;td&gt;&amp;nbsp;es=3&lt;/td&gt;&lt;td&gt;&amp;nbsp;it=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;hu=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;ro&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.9%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=999&lt;/td&gt;&lt;td&gt;&amp;nbsp;fr=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sk&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;98.7%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sk=987&lt;/td&gt;&lt;td&gt;&amp;nbsp;cs=8&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;ro=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;lt=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;et=1&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sl&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;97.2%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sl=972&lt;/td&gt;&lt;td&gt;&amp;nbsp;hr=27&lt;/td&gt;&lt;td&gt;&amp;nbsp;en=1&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;b&gt;sv&lt;/b&gt;&amp;nbsp;&lt;/td&gt;&lt;td align=right&gt;99.0%&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;sv=990&lt;/td&gt;&lt;td&gt;&amp;nbsp;no=8&lt;/td&gt;&lt;td&gt;&amp;nbsp;da=2&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;Some quick analysis:&lt;ul&gt;&lt;li&gt; The language-detection library gets the best accuracy, at 99.22%,    followed by CLD, at 98.82%, followed by Tika at 97.12%.    Net/net these accuracies are very good, especially considering how    short some of the tests are! &lt;br/&gt;&lt;br/&gt;&lt;li&gt; The difficult languages are Danish (confused with Norwegian),    Slovene (confused with Croatian) and Dutch (for Tika and    &lt;code&gt;language-detection&lt;/code&gt;).  Tika in particular has trouble    with Spanish (confuses it with Galician).  These confusions are to    be expected: the languages are very similar.  &lt;/ul&gt;&lt;br/&gt;When &lt;code&gt;language-detection&lt;/code&gt; was wrong, Tika was alsowrong 37% of the time and CLD was also wrong 23% of the time.  Thesenumbers are quite low!  It tells us that the errors are somewhatorthogonal, i.e. the libraries tend to get different test cases wrong.For example, it's not the case that they are all always wrong on the shorttexts.&lt;br/&gt;&lt;br/&gt;This means the libraries are using different overall signals toachieve their classification (for example, perhaps they were trainedon different training texts).  This is encouraging since it means, intheory, one could build a language detection library combining thesignals of all of these libraries and achieve better overall accuracy.&lt;br/&gt;&lt;br/&gt;You could also make a simple majority-rules voting system across these(and other) libraries.  I tried exactly that approach: if any languagereceives 2 or more votes from the three detectors, select that as thedetected language; otherwise, go with &lt;code&gt;language-detection&lt;/code&gt;choice.  This gives the best accuracy of all: total 99.59% (= 16930 /17000)!&lt;br/&gt;&lt;br/&gt;Finally, I also separately tested the run time for each package.  Eachtime is the best of 10 runs through the full corpus:&lt;br/&gt;&lt;br/&gt;&lt;font face=Courier&gt;&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;b&gt;CLD&lt;/b&gt;&lt;/td&gt;&lt;td align=right&gt;&amp;nbsp;171 msec&lt;/td&gt;&lt;td align=right&gt;&amp;nbsp;16.331 MB/sec&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;b&gt;&lt;code&gt;language-detection&lt;/code&gt;&lt;/b&gt;&lt;/td&gt;&lt;td align=right&gt;&amp;nbsp;2367 msec&lt;/td&gt;&lt;td align=right&gt;&amp;nbsp;1.180 MB/sec&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;b&gt;Tika&lt;/b&gt;&lt;/td&gt;&lt;td align=right&gt;&amp;nbsp;42219 msec&lt;/td&gt;&lt;td align=right&gt;&amp;nbsp;0.066 MB/sec&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/font&gt;&lt;br/&gt;&lt;br/&gt;CLD is incredibly fast!  &lt;code&gt;language-detection&lt;/code&gt; is an orderof magnitude slower, and Tika is another order of magnitude slower(not sure why).&lt;br/&gt;&lt;br/&gt;I used &lt;a href="http://code.google.com/p/language-detection/downloads/detail?name=langdetect-09-13-2011.zip&amp;can=2&amp;q="&gt;the09-13-2011 release&lt;/a&gt; of &lt;code&gt;language-detection&lt;/code&gt;, the currenttrunk (svn revision 1187915) of &lt;a href="https://svn.apache.org/repos/asf/tika/trunk/"&gt;Apache Tika&lt;/a&gt;,and the current trunk (hg revision b0adee43f3b1) of &lt;ahref="http://code.google.com/p/chromium-compact-language-detector/source/browse/"&gt;CLD&lt;/a&gt;.All sources for the performance tests are &lt;a href="http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/#hg%2Flangdetect"&gt;available from here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7288601351161582082?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7288601351161582082/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7288601351161582082'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7288601351161582082'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html' title='Accuracy and performance of Google&apos;s Compact Language Detector'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-1091224898755687310</id><published>2011-10-24T14:10:00.000-04:00</published><updated>2011-10-24T14:11:12.770-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Compact Language Detector'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><title type='text'>Additions to Compact Language Detector API</title><content type='html'>&lt;br /&gt;I've made some small improvements after &lt;a href="http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html"&gt;my quick initial port&lt;/a&gt; of &lt;a href="http://code.google.com/p/chromium-compact-language-detector"&gt;Google's Compact Language Detection Library&lt;/a&gt;, starting with some helpful Python constants:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt; &lt;code&gt;cld.ENCODINGS&lt;/code&gt; has all the encoding names recognized by CLD; if you pass the encoding hint it must be one of these.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; &lt;code&gt;cld.LANGUAGES&lt;/code&gt; has the list of all base languages known (but not necessarily detectable) by CLD.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; &lt;code&gt;cld.EXTERNAL_LANGUAGES&lt;/code&gt; has the list of external languages known (but not necessarily detectable) by CLD.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; &lt;code&gt;cld.DETECTED_LANGUAGES&lt;/code&gt; has the list of detectable languages.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;I haven't found a reliable way to get the full list of detectable languages; &amp;nbsp;for now, I've started with all languages that are covered by the unit test, total count 75, which should be a lower bound on the true count.&lt;br /&gt;&lt;br /&gt;I also exposed control over whether CLD should abstain from a given matched language if the confidence is too low, by adding a parameter &lt;code&gt;removeWeakMatches&lt;/code&gt; (required in C and optional in Python, default &lt;code&gt;False&lt;/code&gt;). &amp;nbsp;Turn this option on if abstaining is OK in your use case, such as a browser toolbar offering to translate content. &amp;nbsp;Turn it off when testing accuracy vs other language detection libraries (unless they also abstain!).&lt;br /&gt;&lt;br /&gt;Finally, CLD has an algorithm that tries to pick the best "summary" language, and it doesn't always just pick the highest scoring match. For example, the code has this comment:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;// If English and X, where X (not UNK) is big enough,&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;// assume the English is boilerplate and return X.&lt;br /&gt;&lt;/pre&gt;See the &lt;a href="http://code.google.com/p/chromium-compact-language-detector/source/browse/encodings/compact_lang_det/compact_lang_det_impl.cc#1984"&gt;CalcSummaryLanguage function&lt;/a&gt; for more details!&lt;br /&gt;&lt;br /&gt;I found this was hurting accuracy in testing so I added a parameter &lt;code&gt;pickSummaryLanguage&lt;/code&gt; (default &lt;code&gt;False&lt;/code&gt;) to also turn this on or off.&lt;br /&gt;&lt;br /&gt;Finally, I fixed the Python binding to release the &lt;a href="http://wiki.python.org/moin/GlobalInterpreterLock"&gt;GIL&lt;/a&gt; while CLD is running, so multiple threads can now detect without falsely blocking one another.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-1091224898755687310?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/1091224898755687310/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/10/additions-to-compact-language-detector.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1091224898755687310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1091224898755687310'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/10/additions-to-compact-language-detector.html' title='Additions to Compact Language Detector API'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5671023059566257432</id><published>2011-10-21T15:47:00.000-04:00</published><updated>2011-10-24T14:11:37.524-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Compact Language Detector'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><title type='text'>Language detection with Google's Compact Language Detector</title><content type='html'>&lt;br /&gt;Google's &lt;a href="http://www.google.com/chrome"&gt;Chrome browser&lt;/a&gt; has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.&lt;br /&gt;&lt;br /&gt;Wonderfully, Google has open-sourced &lt;a href="http://code.google.com/chromium"&gt;most of Chrome's source code&lt;/a&gt;, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. &amp;nbsp; It looks like CLD was extracted from the language detection library used in &lt;a href="http://toolbar.google.com/"&gt;Google's toolbar&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;It turns out the &lt;a href="http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/"&gt;CLD part of the Chromium source tree&lt;/a&gt; is nicely standalone, so I pulled it out into a &lt;a href="http://code.google.com/p/chromium-compact-language-detector/"&gt;new separate Google code project&lt;/a&gt;, making it possible to use CLD directly from any C++ code.&lt;br /&gt;&lt;br /&gt;I also added basic initial Python binding (one method!), and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).&lt;br /&gt;&lt;br /&gt;So detecting language is now very simple from Python:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;import cld&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;topLanguageName = cld.detect(bytes)[0]&lt;br /&gt;&lt;/pre&gt;The detect method returns a tuple, including the language name and code (such as &lt;code&gt;RUSSIAN&lt;/code&gt;, &lt;code&gt;ru&lt;/code&gt;), an &lt;code&gt;isReliable&lt;/code&gt; boolean (&lt;code&gt;True&lt;/code&gt; if CLD is quite sure of itself), the number of actual text bytes processed, and then details for each of the top languages (up to 3) that were identified.&lt;br /&gt;&lt;br /&gt;You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand.&lt;br /&gt;&lt;br /&gt;You can also optionally provide hints to the detect method, including the declared encoding and language (for example, from an HTTP header or an embedded &lt;code&gt;META http-equiv&lt;/code&gt; tag in the HTML), as well as the domain name suffix (so the top level domain suffix &lt;code&gt;es&lt;/code&gt; would boost the chances for detecting Spanish).  CLD uses these hints to boost the priors for certain languages.  There is this fun comment in the code in front of the tables holding the per-language prior boots:&lt;pre&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Generated by dsites 2008.07.07 from 10% of Base&lt;br /&gt;&lt;/pre&gt;How I wish I too could build tables off of 10% of Base!&lt;br /&gt;&lt;br /&gt;The code itself looks very cool and I suspect (but haven't formally verified!) its quite accurate. &amp;nbsp;I only understand bits and pieces about how it works; you can read some details &lt;a href="http://www.globalbydesign.com/blog/2010/12/06/inside-googles-language-detection-tool"&gt;here&lt;/a&gt; and &lt;a href="http://www.archive.org/~aaron/iipc/language-detection-investigation.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;It's also not clear just how many languages it can detect; I see there are 161 "base" languages plus 44 "extended" languages, but then I see many test cases (102 out of 166!) commented out. &amp;nbsp;This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/"&gt;This port&lt;/a&gt; is all still very new, and I extracted CLD quickly, so likely there are some problems still to work out, but the fact that it passes the Python unit test is encouraging. &amp;nbsp;The &lt;a href="http://code.google.com/p/chromium-compact-language-detector/source/browse/README.txt"&gt;README.txt&lt;/a&gt; has some more details.&lt;br /&gt;&lt;br /&gt;Thank you Google!&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5671023059566257432?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5671023059566257432/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html#comment-form' title='29 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5671023059566257432'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5671023059566257432'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html' title='Language detection with Google&apos;s Compact Language Detector'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>29</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-3168948629378571995</id><published>2011-09-26T11:29:00.000-04:00</published><updated>2011-09-28T06:54:15.006-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene's SearcherManager simplifies reopen with threads</title><content type='html'>Modern computers have wonderful hardware concurrency, within and across CPU cores, RAM and IO resources, which means your typical server-based search application should use multiple threads to fully utilize all resources.&lt;p&gt;For searching, this usually means you'll have one thread handle each search request, sharing a single &lt;code&gt;IndexSearcher&lt;/code&gt; instance. This model is effective: the Lucene developers work hard to minimize internal locking in all Lucene classes.  In fact, we recently removed thread contention during indexing (specifically, flushing), resulting in &lt;a href="http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html"&gt;massive gains in indexing throughput&lt;/a&gt; on highly concurrent hardware.&lt;p&gt;Since &lt;code&gt;IndexSearcher&lt;/code&gt; exposes a fixed, point-in-time view of the index, when you make changes to the index you'll need to reopen it.  Fortunately, since version 2.9, Lucene has provided the &lt;code&gt;IndexReader.reopen&lt;/code&gt; method to get a new reader reflecting the changes.&lt;p&gt;This operation is efficient: the new reader shares already warmed sub-readers in common with the old reader, so it only opens sub-readers for any newly created segments.  This means reopen time is generally in proportion to how many changes you made; however, when a large merge had completed it will be longer.  It's best to warm the new reader before putting it into production by running a set of "typical" searches for your application, so that Lucene performs one-time initialization for internal data structures (norms, field cache, etc.).&lt;p&gt;But how should you properly reopen, while search threads are still running and new searches are forever arriving?  Your search application is popular, users are always searching and there's never a good time to switch!  The core issue is that you must never close your old &lt;code&gt;IndexReader&lt;/code&gt; while other threads are still using it for searching, otherwise those threads can easily hit cryptic exceptions that often mimic index corruption.&lt;p&gt;Lucene tries to detect that you've done this, and will throw a nice &lt;code&gt;AlreadyClosedException&lt;/code&gt;, but we cannot guarantee that exception is thrown since we only check up front, when the search kicks off: if you close the reader when a search is already underway then all bets are off.&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;img border="0" height="201" width="150" style="clear:right; float:right; margin-left:1em; margin-bottom:1em" src="http://4.bp.blogspot.com/-xnjkxEz-WYA/ToBW8Ot98DI/AAAAAAAAAJU/x8QopWum8tA/s1600/restroom.jpg" /&gt;&lt;/div&gt;&lt;p&gt;One simple approach would be to temporarily block all new searches and wait for all running searches to complete, and then close the old reader and switch to the new one. This is how janitors often clean a bathroom: they wait for all current users to finish and block new users with the all-too-familiar plastic yellow sign.&lt;p&gt;While the bathroom cleaning approach will work, it has an obviously serious drawback: during the cutover you are now forcing your users to wait, and that wait time could be long (the time for the slowest currently running search to finish).&lt;p&gt;A much better solution is to immediately direct new searches to the new reader, as soon as it's done warming, and then separately wait for the still-running searches against the old reader to complete. Once the very last search has finished with the old reader, close it.&lt;p&gt;This solution is fully concurrent: it has no locking whatsoever so searches are never blocked, as long as you use a separate thread to perform the reopen and warming.  The time to reopen and warm the new reader has no impact on ongoing searches, except to the extent that reopen consumes CPU, RAM and IO resources to do its job (and, sometimes, this can in fact interfere with ongoing searches).&lt;p&gt;So how exactly do you implement this approach?  The simplest way is to use the reference counting APIs already provided by &lt;code&gt;IndexReader&lt;/code&gt; to track how many threads are currently using each searcher.  Fortunately, as of Lucene 3.5.0, there will be a new &lt;code&gt;contrib/misc&lt;/code&gt; utility class, &lt;code&gt;SearcherManager&lt;/code&gt;, originally created as an example for &lt;a href="http://www.manning.com/hatcher3/"&gt;Lucene in Action, 2nd edition&lt;/a&gt;, that does this for you!  (&lt;a href="https://issues.apache.org/jira/browse/LUCENE-3445"&gt;LUCENE-3445&lt;/a&gt; has the details.)&lt;p&gt;The class is easy to use. You first create it, by providing the &lt;code&gt;Directory&lt;/code&gt; holding your index and a &lt;code&gt;SearchWarmer&lt;/code&gt; instance:&lt;pre&gt;&lt;br /&gt;  class MySearchWarmer implements SearchWarmer {&lt;br /&gt;    @Override&lt;br /&gt;    public void warm(IndexSearcher searcher) throws IOException {&lt;br /&gt;      // Run some diverse searches, searching and sorting against all&lt;br /&gt;      // fields that are used by your application&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  Directory dir = FSDirectory.open(new File("/path/to/index"));&lt;br /&gt;  SearcherManager mgr = new SearcherManager(dir,&lt;br /&gt;                                            new MySearchWarmer());&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;Then, for each search request:&lt;pre&gt;&lt;br /&gt;  IndexSearcher searcher = mgr.acquire();&lt;br /&gt;  try {&lt;br /&gt;    // Do your search, including loading any documents, etc.&lt;br /&gt;  } finally {&lt;br /&gt;    mgr.release(searcher);&lt;br /&gt;&lt;br /&gt;    // Set to null to ensure we never again try to use&lt;br /&gt;    // this searcher instance after releasing:&lt;br /&gt;    searcher = null;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;p&gt;Be sure you fully consume &lt;code&gt;searcher&lt;/code&gt; before releasing it!  A common mistake is to release it yet later accidentally use it again to load stored documents, for rendering the search results for the current page.&lt;p&gt;Finally, you'll need to periodically call the &lt;code&gt;maybeReopen&lt;/code&gt; method from a separate (ie, non-searching) thread.  This method will reopen the reader, and only if there was actually a change will it cutover.  If your application knows when changes have been committed to the index, you can reopen right after that.  Otherwise, you can simply call &lt;code&gt;maybeReopen&lt;/code&gt; every X seconds.  When there has been no change to the index, the cost of &lt;code&gt;maybeReopen&lt;/code&gt; is negligible, so calling it frequently is fine.&lt;p&gt;Beware the potentially high transient cost of reopen and warm!  During reopen, as you must have two readers open until the old one can be closed, you should budget plenty of RAM in the computer and heap for the JVM, to comfortably handle the worst case when the two readers share no sub-readers (for example, after a full optimize) and thus consume 2X the RAM of a single reader.  Otherwise you might hit a swap storm or &lt;code&gt;OutOfMemoryError&lt;/code&gt;, effectively taking down entire whole search application.  Worse, you won't see this problem early on: your first few hundred reopens could easily use only small amounts of added heap, but then suddenly on some unexpected reopen the cost is far higher.  Reopening and warming is also generally IO intensive as the reader must load certain index data structures into memory.&lt;p&gt;Next time I'll describe another utility class, &lt;code&gt;NRTManager&lt;/code&gt;, available since version 3.3.0, that you should use instead if your application uses Lucene's fast-turnaround near-real-time (NRT) search.  This class solves the  same problem (thread-safety during reopening) as &lt;code&gt;SearcherManager&lt;/code&gt; but adds a fun twist as it gives you more specific control over which changes must be visible in the newly opened reader.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-3168948629378571995?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/3168948629378571995/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3168948629378571995'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3168948629378571995'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html' title='Lucene&apos;s &lt;code&gt;SearcherManager&lt;/code&gt; simplifies reopen with threads'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-xnjkxEz-WYA/ToBW8Ot98DI/AAAAAAAAAJU/x8QopWum8tA/s72-c/restroom.jpg' height='72' width='72'/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-520338935312671001</id><published>2011-06-30T06:09:00.004-04:00</published><updated>2011-06-30T08:38:17.755-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Primary key lookups are 2.8X faster with MemoryCodec</title><content type='html'>A few days ago I committed the new &lt;a href="https://issues.apache.org/jira/browse/LUCENE-3209"&gt;MemoryCodec&lt;/a&gt; to Lucene's trunk (to be 4.0).  This codec indexes all terms and postings into a compact &lt;a href="http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html"&gt;finite-state transducer&lt;/a&gt; (FST) and then, at search time, avoids I/O by performing all terms and postings enumerations in memory using the FST.&lt;br /&gt;&lt;br /&gt;If your application needs fast primary-key lookups, and you can afford the required additional memory, this codec might be a good match for the &lt;tt&gt;id&lt;/tt&gt; field.  To test this, I switched Lucene's nightly benchmark to use &lt;tt&gt;MemoryCodec&lt;/tt&gt; (just for its &lt;tt&gt;id&lt;/tt&gt; field), and performance jumped from around 179 K to 509 K lookups per second:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://people.apache.org/~mikemccand/lucenebench/PKLookup.html"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 302px;" src="http://2.bp.blogspot.com/-RvDf4osoNEk/TgxMaLuDtlI/AAAAAAAAAIM/TIIvCHOUmL0/s1600/PKLookupMemCodec.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5623954047385187922" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This is an awesome improvement!  It's particularly impressive as the &lt;tt&gt;id&lt;/tt&gt; field was previously indexed using &lt;tt&gt;PulsingCodec&lt;/tt&gt;, which was &lt;a href="http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html"&gt;already faster than the default &lt;tt&gt;StandardCodec&lt;/tt&gt;&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This is the performance for a single thread, and should scale up linearly if you use multiple threads. Each lookup resolves 4,000 keys in order at once from the &lt;tt&gt;id&lt;/tt&gt; field, performing the lookups segment by segment for best performance (&lt;a href="http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/perf/SearchPerfTest.java#353"&gt;see the source code&lt;/a&gt;). The index has 27.6 M docs across multiple segments.&lt;br /&gt;&lt;br /&gt;Of course, there is added memory required, specifically 188 MB for this index, which works out to 7.1 bytes per document on average.&lt;br /&gt;&lt;br /&gt;There are two sources of &lt;tt&gt;MemoryCodec&lt;/tt&gt;'s gains.  First, the obvious one: since everything is in memory, you never wait for an I/O seek operation, as long as you ensure the sneaky OS &lt;a href="http://blog.mikemccandless.com/2011/04/just-say-no-to-swapping.html"&gt;never swaps out your process memory&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Second, I separately added a new &lt;a href="https://issues.apache.org/jira/browse/LUCENE-3225"&gt;&lt;tt&gt;seekExact&lt;/tt&gt; API&lt;/a&gt; to &lt;tt&gt;TermsEnum&lt;/tt&gt;, enabling codecs to save CPU if the caller does not need to know the following term when the target term doesn't exist, as is the case here.  &lt;tt&gt;MemoryCodec&lt;/tt&gt; has an optimized implementation for &lt;tt&gt;seekExact&lt;/tt&gt; (and so does the cool &lt;a href="http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html"&gt;&lt;tt&gt;SimpleTextCodec&lt;/tt&gt;&lt;/a&gt;!).  Eventually other codecs should as well, by using the &lt;a href="https://issues.apache.org/jira/browse/LUCENE-3030"&gt;block tree terms index&lt;/a&gt;, but we're not there yet.&lt;br /&gt;&lt;br /&gt;The &lt;tt&gt;id&lt;/tt&gt; field in the nightly benchmark omits term freq and positions, however &lt;tt&gt;MemoryCodec&lt;/tt&gt; is fully general: you can use it for any field (not just primary-key), storing positions, payloads, etc. Also, its values are zero-padded sequential integers (00000001, 00000002, 00000003, etc.), which is likely important for performance as it allows maximal sharing in the FST.  I haven't tested but I suspect had I used something more random, such as &lt;a href="http://en.wikipedia.org/wiki/Globally_unique_identifier"&gt;GUIDs&lt;/a&gt;, memory usage would be higher and lookup performance worse as each segment's FST would be less dense (share less).&lt;br /&gt;&lt;br /&gt;Of course, Lucene is not a database, and you normally use it for its fast search performance, not primary-key lookups.  The one common search use case where you do require primary-key lookups is during indexing, when deleting or updating documents by an &lt;tt&gt;id&lt;/tt&gt; field.  Near-realtime search with updates or deletions &lt;a href="http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html"&gt;relies on this&lt;/a&gt;, since the deleted documents must be resolved during reopen, so we also see a healthy speedup in the NRT reopen time:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://people.apache.org/~mikemccand/lucenebench/nrt.html"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 308px;" src="http://1.bp.blogspot.com/-x4CBP_7F-O4/TgxMhavx5EI/AAAAAAAAAIU/_YbOYHIti78/s1600/NRTMemCodec.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5623954171678024770" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The NRT latencey dropped from around 52 milliseconds to 43 milliseconds, a 17% improvement.  This is "only" 17% because opening a new reader must also do other things like flush the indexed documents as a new segment.&lt;br /&gt;&lt;br /&gt;Perhaps more importantly, the variance also dropped substantially, which is expected because with &lt;tt&gt;MemoryCodec&lt;/tt&gt; and &lt;tt&gt;NRTCachingDirectory&lt;/tt&gt;, NRT reopen is fully I/O free (performs no reads or writes when opening a new reader).&lt;br /&gt;&lt;br /&gt;One limitation of &lt;tt&gt;MemoryCodec&lt;/tt&gt; is it's an all-or-nothing deal: all terms and postings are in memory, or they aren't.  &lt;a href="https://issues.apache.org/jira/browse/LUCENE-3069"&gt;LUCENE-3069&lt;/a&gt;, still to be done (any volunteers?), aims to fix this, by enabling you to separately choose whether terms and/or postings data should be in memory.&lt;br /&gt;&lt;br /&gt;I suspect an even more specialized codec, for example one that requires the field values to be compact integers, and also requires that the values are unique (only supports primary-key fields), could do even better than &lt;tt&gt;MemoryCodec&lt;/tt&gt; by storing the mapping in global (across all segments) parallel arrays.  Such a codec would no longer be general; it'd only work for primary-key fields whose values are compact integers.  But it'd have  faster lookups than &lt;tt&gt;MemoryCodec&lt;/tt&gt; and should use less memory per document.  This codec could simply wrap any other codec, i.e. it would create the arrays on reader initialization, and delegate persisting the postings into the index to the wrapped codec.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-520338935312671001?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/520338935312671001/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/06/primary-key-lookups-are-28x-faster-with.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/520338935312671001'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/520338935312671001'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/06/primary-key-lookups-are-28x-faster-with.html' title='Primary key lookups are 2.8X faster with MemoryCodec'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-RvDf4osoNEk/TgxMaLuDtlI/AAAAAAAAAIM/TIIvCHOUmL0/s72-c/PKLookupMemCodec.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5306626649764544993</id><published>2011-06-14T13:27:00.004-04:00</published><updated>2011-06-14T13:46:35.369-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Near-real-time latency during large merges</title><content type='html'>I looked into the curious issue I described in my &lt;a href="http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html"&gt;last post&lt;/a&gt;, where the NRT reopen delays can become "spikey" (take longer) during a large merge.&lt;br /&gt;&lt;br /&gt;To show the issue, I modified the NRT test to kick off a background optimize on startup. This runs a single large merge, creating a 13 GB segment, and indeed produces spikey reopen delays (purple):&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://people.apache.org/~mikemccand/NRTNoRateLimit.html"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 363px;" src="http://1.bp.blogspot.com/-A3JN33lTesg/Tfeaw5owp2I/AAAAAAAAAH8/iqBTypzBjrQ/s1600/NRTNoRateLimit.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5618129225064163170" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The large merge finishes shortly after 7 minutes, after which the reopen delays become healthy again.  Search performance (green) is unaffected.&lt;br /&gt;&lt;br /&gt;I also added Linux'd dirty bytes to the graph, as reported by &lt;tt&gt;/proc/meminfo&lt;/tt&gt;; it's the saw-tooth blue/green series on the bottom.  Note that it's divided by 10, to better fit the Y axis; the peaks are around 800-900 MB.&lt;br /&gt;&lt;br /&gt;The large merge writes bytes a fairly high rate (around 30 MB/sec), but Linux buffers those writes in RAM, only actually flushing them to disk every 30 seconds; this is what produces the saw-tooth pattern.&lt;br /&gt;&lt;br /&gt;From the graph you can see that the spikey reopen delays generally correlate to when Linux is flushing the dirty pages to disk. Apparently, this heavy write IO interferes with the read IO required when resolving deleted terms to document IDs.  To confirm this, I ran the same stress test, but with only adds (no deletions); the reopen delays were then unaffected by the ongoing large merge.&lt;br /&gt;&lt;br /&gt;So finally the mystery is explained, but, how to fix it?&lt;br /&gt;&lt;br /&gt;I know I could &lt;a href="http://www.westnet.com/~gsmith/content/linux-pdflush.htm"&gt;tune Linux's IO&lt;/a&gt;, for example to write more frequently, but I'd rather find a Lucene-only solution since we can't expect most users to tune the OS.&lt;br /&gt;&lt;br /&gt;One possibility is to make a RAM resident terms dictionary, just for primary-key fields.  This could be very compact, for example by using an FST, and should give lookups that never hit disk unless the OS has &lt;a href="http://blog.mikemccandless.com/2011/04/just-say-no-to-swapping.html"&gt;frustratingly swapped out your RAM data structures&lt;/a&gt;.  This can also be separately useful for applications that need fast document lookup by primary key, so someone should at some point build this.&lt;br /&gt;&lt;br /&gt;Another, lower level idea is to simply rate limit byte/sec written by merges.  Since big merges also impact ongoing searches, likely we could help that case as well.  To try this out, I made a simple prototype (see &lt;a href="https://issues.apache.org/jira/browse/LUCENE-3203"&gt;LUCENE-3202&lt;/a&gt;), and then re-ran the same stress test, limiting all merging to 10 MB/sec:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://people.apache.org/~mikemccand/NRTRateLimit.html"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 364px;" src="http://3.bp.blogspot.com/-L5r585dC6hY/Tfea4bXMwvI/AAAAAAAAAIE/nNqNU-TomXU/s1600/NRTRateLimit.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5618129354376397554" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The optimize now took 3 times longer, and the peak dirty bytes (around 300 MB) is 1/3rd as large, as expected since the IO write rate is limited to 10 MB/sec.  But look at the reopen delays: they are now much better contained, averaging around 70 milliseconds while the optimize is running, and dropping to 60 milliseconds once the optimize finishes.  I think the ability to limit merging IO is an important feature for Lucene!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5306626649764544993?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5306626649764544993/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/06/near-real-time-latency-during-large.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5306626649764544993'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5306626649764544993'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/06/near-real-time-latency-during-large.html' title='Near-real-time latency during large merges'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-A3JN33lTesg/Tfeaw5owp2I/AAAAAAAAAH8/iqBTypzBjrQ/s72-c/NRTNoRateLimit.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-6323037927906542466</id><published>2011-06-07T18:19:00.009-04:00</published><updated>2011-06-09T10:03:46.674-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene's near-real-time search is fast!</title><content type='html'>Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time.   In some cases, such as modern social/news sites (e.g., &lt;a href="http://linkedin.com/"&gt;LinkedIn&lt;/a&gt;, &lt;a href="http://twitter.com/"&gt;Twitter&lt;/a&gt;, &lt;a href="http://facebook.com/"&gt;Facebook&lt;/a&gt;, &lt;a href="http://stackoverflow.com/"&gt;Stack Overflow&lt;/a&gt;, &lt;a href="http://news.ycombinator.com/"&gt;Hacker News&lt;/a&gt;, &lt;a href="http://dzone.com/"&gt;DZone&lt;/a&gt;, etc.), fast turnaround time is a hard requirement.&lt;br /&gt;&lt;br /&gt;Fortunately, it's trivial to use.  Just open your initial NRT reader, like this:&lt;br /&gt;&lt;tt&gt;&lt;br /&gt;    // w is your IndexWriter&lt;br /&gt;    IndexReader r = IndexReader.open(w, true);&lt;br /&gt;&lt;/tt&gt;&lt;br /&gt;(That's the 3.1+ API; prior to that use &lt;tt&gt;w.getReader()&lt;/tt&gt; instead).&lt;br /&gt;&lt;br /&gt;The returned reader behaves just like one opened with &lt;tt&gt;IndexReader.open&lt;/tt&gt;: it exposes the point-in-time snapshot of the index as of when it was opened.  Wrap it in an &lt;tt&gt;IndexSearcher&lt;/tt&gt; and search away!&lt;br /&gt;&lt;br /&gt;Once you've made changes to the index, call &lt;tt&gt;r.reopen()&lt;/tt&gt; and you'll get another NRT reader; just be sure to close the old one.&lt;br /&gt;&lt;br /&gt;What's special about the NRT reader is that it searches uncommitted changes from &lt;tt&gt;IndexWriter&lt;/tt&gt;, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often &lt;tt&gt;commit&lt;/tt&gt; is called), something not previously possible.&lt;br /&gt;&lt;br /&gt;Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes.  The reopen time is in proportion to how many changes you made since last reopening that reader.&lt;br /&gt;&lt;br /&gt;Lucene's approach is a nice compromise between &lt;a href="http://en.wikipedia.org/wiki/Immediate_consistency"&gt;immediate consistency&lt;/a&gt;, where changes are visible after each index change, and &lt;a href="http://en.wikipedia.org/wiki/Eventual_consistency"&gt;eventual consistency&lt;/a&gt;, where changes are visible "later" but you don't usually know exactly when.&lt;br /&gt;&lt;br /&gt;With NRT, your application has &lt;em&gt;controlled consistency&lt;/em&gt;: you decide exactly when changes must become visible.&lt;br /&gt;&lt;br /&gt;Recently there have been some good improvements related to NRT:&lt;ul&gt;&lt;li&gt; New default merge policy, &lt;a href="http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html"&gt;&lt;tt&gt;TieredMergePolicy&lt;/tt&gt;&lt;/a&gt;, which is able to select more efficient non-contiguous merges, and favors segments with more deletions.&lt;br /&gt;&lt;br /&gt; &lt;li&gt; &lt;tt&gt;NRTCachingDirectory&lt;/tt&gt; takes load off the IO system by caching small segments in RAM (&lt;a href="https://issues.apache.org/jira/browse/LUCENE-3092"&gt;LUCENE-3092&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt; &lt;li&gt; When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (&lt;a href="https://issues.apache.org/jira/browse/LUCENE-2900"&gt;LUCENE-2900&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt; &lt;li&gt; Segments that are 100% deleted are now dropped instead of inefficiently merged (&lt;a href="https://issues.apache.org/jira/browse/LUCENE-2010"&gt;LUCENE-2010&lt;/a&gt;).&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;How fast is NRT search?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I created a &lt;a href="http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/perf/NRTPerfTest.java"&gt;simple performance test&lt;/a&gt; to answer this.  I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents.&lt;br /&gt;&lt;br /&gt;Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text.  This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as &lt;a href="http://blog.twitter.com/2011/02/superbowl.html"&gt;Twitter's recent peak during this year's superbowl&lt;/a&gt; (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard.&lt;br /&gt;&lt;br /&gt;The test uses &lt;tt&gt;updateDocument&lt;/tt&gt;, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments.  In addition, 8 search threads run a fixed &lt;tt&gt;TermQuery&lt;/tt&gt; at the same time.&lt;br /&gt;&lt;br /&gt;Finally, the NRT reader is reopened once per second.&lt;br /&gt;&lt;br /&gt;I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an &lt;a href="http://www.ocztechnology.com/ocz-vertex-3-sata-iii-2-5-ssd.html"&gt;OCZ Vertex 3 240 GB SSD&lt;/a&gt;, using Oracle's 64 bit &lt;tt&gt;Java 1.6.0_21&lt;/tt&gt; and Linux Fedora 13.  I gave &lt;tt&gt;Java&lt;/tt&gt; a 2 GB max heap, and used &lt;tt&gt;MMapDirectory&lt;/tt&gt;.&lt;br /&gt;&lt;br /&gt;The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://people.apache.org/~mikemccand/NRTMMap.html" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img boder="0" style="cursor:pointer; cursor:hand;width: 600px; height: 364px;" src="http://3.bp.blogspot.com/-eIz5ug2ef14/Te-ynIR96cI/AAAAAAAAAH0/SvvtFxKp1hA/s1600/NRTMMap.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5615632934752653266" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive &lt;a href="http://dygraphs.com/"&gt;Dygraph&lt;/a&gt;, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph.&lt;br /&gt;&lt;br /&gt;Search QPS dropped substantially with time.  While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching.  They are only truly deleted when the segments are merged.  &lt;tt&gt;TermQuery&lt;/tt&gt; is a worst-case query; harder queries, such as &lt;tt&gt;BooleanQuery&lt;/tt&gt;, should see less slowdown from deleted, but not reclaimed, documents.&lt;br /&gt;&lt;br /&gt;Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped.  It looks like &lt;tt&gt;TieredMergePolicy&lt;/tt&gt; should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off.  Once it finished the QPS recovered somewhat.&lt;br /&gt;&lt;br /&gt;Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Reopen delay during merging&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate).  There are random single spikes, which is caused by Java running a full GC cycle.  However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40).  Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay.  Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching.&lt;br /&gt;&lt;br /&gt;I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities.  It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen.  However, if that were the case, &lt;tt&gt;NRTCachingDirectory&lt;/tt&gt; (used by the test) should have prevented it, but didn't.  It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've &lt;a href="http://blog.mikemccandless.com/2011/04/just-say-no-to-swapping.html"&gt;set Linux's swappiness to 0&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's &lt;a href="http://www.westnet.com/~gsmith/content/linux-pdflush.htm"&gt;pdflush tunables&lt;/a&gt; could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness.&lt;br /&gt;&lt;br /&gt;Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling &lt;tt&gt;Directory&lt;/tt&gt; implementations to pass appropriate flags to the OS when opening files for merging (&lt;a href="https://issues.apache.org/jira/browse/LUCENE-2793"&gt;LUCENE-2793&lt;/a&gt; and &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2795"&gt;LUCENE-2795&lt;/a&gt;).  From past testing I know that passing O_DIRECT &lt;a href="http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html"&gt;can prevent merges from evicting hot pages&lt;/a&gt;, so it's possible this will fix our slow reopen time as well since it bypasses the write cache.&lt;br /&gt;&lt;br /&gt;Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges.&lt;br /&gt;&lt;br /&gt;This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it.  It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity.  This would make a nice addition to Lucene!&lt;br /&gt;&lt;br /&gt;Except for the slowdown during merging, the performance of NRT is impressive.  Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough.&lt;br /&gt;&lt;br /&gt;While there are exciting ideas to bring true real-time search to Lucene, by directly searching &lt;tt&gt;IndexWriter&lt;/tt&gt;'s RAM buffer as &lt;a href="http://vimeo.com/16063395"&gt;Michael Busch has implemented at Twitter&lt;/a&gt; with some cool &lt;a href="http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html"&gt;custom extensions to Lucene&lt;/a&gt;, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;tt&gt;NIOFSDirectory&lt;/tt&gt; vs &lt;tt&gt;MMapDirectory&lt;/tt&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Out of curiosity, I ran the exact same test as above, but this time with &lt;tt&gt;NIOFSDirectory&lt;/tt&gt; instead of &lt;tt&gt;MMapDirectory&lt;/tt&gt;:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://people.apache.org/~mikemccand/NRTNIOFS.html" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img border="0" style="cursor:pointer; cursor:hand;width: 600px; height: 364px;" src="http://3.bp.blogspot.com/-sKXoqkywx9U/Te9HMoW7HFI/AAAAAAAAAHs/rNjMNnZugyk/s1600/NRTNIOFS.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5615785542671866962" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There are some interesting differences.  The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot.  For some reason &lt;tt&gt;TermQuery&lt;/tt&gt;, in particular, has &lt;a href="http://people.apache.org/~mikemccand/lucenebench/Term.html"&gt;high variance&lt;/a&gt; from one JVM instance to another.&lt;br /&gt;&lt;br /&gt;The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur).  The takeaway message seems clear: on Linux, use &lt;tt&gt;MMapDirectory&lt;/tt&gt; not &lt;tt&gt;NIOFSDirectory&lt;/tt&gt;!&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Optimizing your NRT turnaround time&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text).  You should test specifically for your use-case what reopen rate works best.  Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate.&lt;br /&gt;&lt;br /&gt;Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements.&lt;br /&gt;&lt;br /&gt;There are also some simple steps you can take to reduce the turnaround time:&lt;ul&gt;&lt;li&gt; Store the index on a fast IO system, ideally a modern SSD.&lt;br /&gt;&lt;br /&gt; &lt;li&gt; Install a merged segment warmer (see &lt;tt&gt;IndexWriter.setMergedSegmentWarmer&lt;/tt&gt;).  This warmer is invoked by &lt;tt&gt;IndexWriter&lt;/tt&gt; to warm up a newly merged segment without blocking the reopen of a new NRT reader.  If your application uses Lucene's &lt;tt&gt;FieldCache&lt;/tt&gt; or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader.&lt;br /&gt;&lt;br /&gt; &lt;li&gt; Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices.  The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk).&lt;br /&gt;&lt;br /&gt; &lt;li&gt; If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the &lt;tt&gt;Pulsing&lt;/tt&gt; codec for your &lt;tt&gt;id&lt;/tt&gt; field since this &lt;a href="http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html"&gt;gives faster lookup performance&lt;/a&gt; which will make your reopen faster.&lt;br /&gt;&lt;br /&gt; &lt;li&gt; Use the new &lt;tt&gt;NRTCachingDirectory&lt;/tt&gt;, which buffers small segments in RAM to take load off the IO system (&lt;a href="https://issues.apache.org/jira/browse/LUCENE-3092"&gt;LUCENE-3092&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt; &lt;li&gt; Pass &lt;tt&gt;false&lt;/tt&gt; for &lt;tt&gt;applyDeletes&lt;/tt&gt; when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader.&lt;br /&gt;&lt;br /&gt; &lt;li&gt; While it's not clear that thread priorities actually work correctly (see &lt;a href="http://www.youtube.com/watch?v=uL2D3qzHtqY"&gt;this Google Tech Talk&lt;/a&gt;), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads.  If the machine becomes saturated, ideally only the search threads should take the hit.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;Happy near-real-time searching!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-6323037927906542466?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/6323037927906542466/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html#comment-form' title='17 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6323037927906542466'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6323037927906542466'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html' title='Lucene&apos;s near-real-time search is fast!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-eIz5ug2ef14/Te-ynIR96cI/AAAAAAAAAH0/SvvtFxKp1hA/s72-c/NRTMMap.png' height='72' width='72'/><thr:total>17</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-9047309606636504278</id><published>2011-05-21T08:49:00.002-04:00</published><updated>2011-05-21T09:00:49.799-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>The invisible Lucene bug fixed point</title><content type='html'>It turns out, &lt;a href="http://www.atlassian.com/software/jira/"&gt;the Jira issue tracking system&lt;/a&gt;, which we make heavy use of here at &lt;a href="http://apache.org"&gt;Apache&lt;/a&gt;, uses &lt;a href="http://lucene.apache.org"&gt;Lucene&lt;/a&gt; under the hood for searching and browsing issues. This is wonderful since it means Lucene developers are &lt;a href="http://en.wikipedia.org/wiki/Eating_your_own_dog_food"&gt;eating their own dog food&lt;/a&gt; whenever they use Jira.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.atlassian.com"&gt;Atlassian&lt;/a&gt; has opened up some doozy bugs over time, including one of the earliest bug numbers I've ever worked on, &lt;a href="https://issues.apache.org/jira/browse/LUCENE-140"&gt;LUCENE-140&lt;/a&gt;. They sent me a t-shirt for fixing that one (thank you!).&lt;br /&gt;&lt;br /&gt;Now, imagine this: what if there were a sneaky bug in Lucene, say a certain text fragment that causes an exception during indexing.  A user &lt;a href="https://issues.apache.org/jira/browse/LUCENE"&gt;opens an issue&lt;/a&gt; to report this, including the problematic text fragment, yet, because Jira uses Lucene, it hits an exception while indexing that fragment and causes this one bug to be un-searchable and un-viewable when browsing!  An invisible bug fixed point.&lt;br /&gt;&lt;br /&gt;It's somewhat mind bending to think about, Lucene recursing on itself through Jira, yet it's theoretically possible!  Maybe we have a few of invisible bug fixed points lurking already and nobody knows...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-9047309606636504278?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/9047309606636504278/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/05/invisible-lucene-bug-fixed-point.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9047309606636504278'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9047309606636504278'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/05/invisible-lucene-bug-fixed-point.html' title='The invisible Lucene bug fixed point'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7527378770560092483</id><published>2011-05-07T14:58:00.010-04:00</published><updated>2011-05-09T08:51:50.420-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>265% indexing speedup with Lucene's concurrent flushing</title><content type='html'>&lt;a href="http://blog.mikemccandless.com/2011/04/catching-slowdowns-in-lucene.html"&gt;A week ago&lt;/a&gt;, I described the &lt;a href="http://people.apache.org/~mikemccand/lucenebench/"&gt;nightly benchmarks&lt;/a&gt; we use to catch any unexpected slowdowns in Lucene's performance.  Back then the graphs were rather boring (a good thing), but, not anymore!  Have a look at the &lt;a href="http://people.apache.org/~mikemccand/lucenebench/indexing.html"&gt;stunning jumps in Lucene's indexing rate&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://people.apache.org/~mikemccand/lucenebench/indexing.html"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 200px;" src="http://3.bp.blogspot.com/-Se4IoeNW-Cc/Tcet-DxtBfI/AAAAAAAAAHU/IK29w-AV1ag/s400/ConcurrentFlushing.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5604639542963144178" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;(Click through the image to see details about what changed on dates &lt;b&gt;A&lt;/b&gt;, &lt;b&gt;B&lt;/b&gt;, &lt;b&gt;C&lt;/b&gt; and &lt;b&gt;D&lt;/b&gt;).&lt;br /&gt;&lt;br /&gt;Previously we were around 102 GB of plain text per hour, and now it's about 270 GB/hour.  That's a 265% jump!  Lucene now indexes all of Wikipedia's &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download"&gt;23.2 GB (English) export&lt;/a&gt; in 5 minutes and 10 seconds.&lt;br /&gt;&lt;br /&gt;How did this happen?  &lt;a href="https://issues.apache.org/jira/browse/LUCENE-3023"&gt;Concurrent flushing&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;That new feature, having lived on a branch for quite some time, undergoing many fun iterations, was finally merged back to trunk about a week ago.&lt;br /&gt;&lt;br /&gt;Before concurrent flushing, whenever &lt;tt&gt;IndexWriter&lt;/tt&gt; needed to flush a new segment, it would stop all indexing threads and hijack one thread to perform the rather compute intensive flush.  This was a nasty bottleneck on computers with highly concurrent hardware; flushing was inherently single threaded.  I &lt;a href="http://blog.mikemccandless.com/2010/09/lucenes-indexing-is-fast.html"&gt;previously described the problem here&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;But with concurrent flushing, each thread freely flushes its own segment even while other threads continue indexing.  No more bottleneck!&lt;br /&gt;&lt;br /&gt;Note that there are two separate jumps in the graph.  The first jump, the day concurrent flushing landed (labelled as &lt;b&gt;B&lt;/b&gt; on the graph), shows the improvement while using only 6 threads and 512 MB RAM buffer during indexing.  Those settings resulted in the fastest indexing rate before concurrent flushing.&lt;br /&gt;&lt;br /&gt;The second jump (labelled as &lt;b&gt;D&lt;/b&gt; on the graph) happened when I increased the indexing threads to 20 and dropped the RAM buffer to 350 MB, giving the fastest indexing rate after concurrent flushing.&lt;br /&gt;&lt;br /&gt;One nice side effect of concurrent flushing is that you can now use RAM buffers well over 2.1 GB, as long as you use multiple threads.  Curiously, I found that larger RAM buffers slow down overall indexing rate.  This might be because of the discontinuity when closing &lt;tt&gt;IndexWriter&lt;/tt&gt;, when we must wait for all the RAM buffers to be written to disk.  It would be better to measure steady state indexing rate, while indexing an effectively infinite content source, and ignoring the startup and ending transients; I suspect if I measured that instead, we'd see gains from larger RAM buffers, but this is just speculation at this point.&lt;br /&gt;&lt;br /&gt;There were some &lt;b&gt;very&lt;/b&gt; challenging changes required to make concurrent flushing work, especially around how &lt;tt&gt;IndexWriter&lt;/tt&gt; handles buffered deletes.  Simon Willnauer does a great job describing these changes &lt;a href="http://blog.jteam.nl/2011/05/03/lucene-indexing-gains-concurrency/"&gt;here&lt;/a&gt; and &lt;a href="http://blog.jteam.nl/2011/04/01/gimme-all-resources-you-have-i-can-use-them/"&gt;here&lt;/a&gt;.  Concurrency is tricky!&lt;br /&gt;&lt;br /&gt;Remember this change only helps you if you have concurrent hardware, you use enough threads for indexing and there's no other bottleneck (for example, in the content source that provides the documents).  Also, if your IO system can't keep up then it will bottleneck your CPU concurrency.  The nightly benchmark runs on a computer with 12 real (24 with hyperthreading) cores and a fast (OCZ Vertex 3) solid-state disk.  Finally, this feature is not yet released: it was committed to Lucene's trunk, which will eventually be released as 4.0.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7527378770560092483?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7527378770560092483/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7527378770560092483'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7527378770560092483'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html' title='265% indexing speedup with Lucene&apos;s concurrent flushing'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-Se4IoeNW-Cc/Tcet-DxtBfI/AAAAAAAAAHU/IK29w-AV1ag/s72-c/ConcurrentFlushing.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8418209192171448474</id><published>2011-04-29T14:46:00.005-04:00</published><updated>2011-05-01T12:29:20.575-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Catching slowdowns in Lucene</title><content type='html'>Lucene has &lt;a href="http://blog.mikemccandless.com/2011/03/your-test-cases-should-sometimes-fail.html"&gt;great randomized tests&lt;/a&gt; to catch functional failures, but when we accidentally commit a performance regression (we slow down indexing or searching), nothing catches us!&lt;br /&gt;&lt;br /&gt;This is scary, because we want things to get only faster with time.&lt;br /&gt;&lt;br /&gt;So, when there's a core change that we think may impact performance, we run before/after tests to verify.  But this is ad-hoc and error-proned: we could easily forget to do this, or fail to anticipate that a code change might have a performance impact.&lt;br /&gt;&lt;br /&gt;Even when we do test performance of a change, the slowdown could be relatively small, easily hiding within the unfortunately often substantial noise of our tests.  Over time we might accumulate many such small, unmeasurable slowdowns, suffering the fate of the &lt;a href="http://en.wikipedia.org/wiki/Boiling_frog"&gt;boiling frog&lt;/a&gt;.  We do also run performance tests before releasing, but it's better to catch them sooner: solving slowdowns just before releasing is.... dangerous.&lt;br /&gt;&lt;br /&gt;To address this problem, I've created a &lt;a href="http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/nightlyBench.py"&gt;script&lt;/a&gt; that runs standard benchmarks on Lucene's trunk (to be 4.0), nightly.  It indexes all of &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download"&gt;Wikipedia's English XML export&lt;/a&gt;, three times (with different settings and document sizes), runs a near-real-time (NRT) turnaround time test for 30 minutes, and finally a diverse set of hard queries.&lt;br /&gt;&lt;br /&gt;This has been running for a few weeks now, and the results are &lt;a href="http://people.apache.org/~mikemccand/lucenebench"&gt;accessible to anyone&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;It's wonderful to see that &lt;a href="http://people.apache.org/~mikemccand/lucenebench/indexing.html"&gt;Lucene's indexing throughput&lt;/a&gt; is already a bit faster (~98 GB plain text per hour) than when &lt;a href="http://blog.mikemccandless.com/2010/09/lucenes-indexing-is-fast.html"&gt;I last measured&lt;/a&gt;!&lt;br /&gt;&lt;br /&gt;Near-real-time reopen latency &lt;a href="http://people.apache.org/~mikemccand/lucenebench/nrt.html"&gt;is here&lt;/a&gt;; the test measures how long it takes (on average, after discarding outliers) to open a new NRT reader.  It's quite intensive, indexing around 1 MB plain text per second as updates (delete+addDocument), and reopening once per second, on the full previously built Wikipedia index.&lt;br /&gt;&lt;br /&gt;To put this in perspective, that's almost twice &lt;a href="http://www.twitter.com"&gt;Twitter's&lt;/a&gt; recent peak indexing rate &lt;a href="http://blog.twitter.com/2011/02/superbowl.html"&gt;during the 2011 Superbowl&lt;/a&gt; (4,064 Tweets/second), although Twitter's use-case is harder because the documents are much smaller, and presumably there's additional indexed metadata beyond just the text of the Tweet.  Twitter has actually implemented some cool changes to Lucene to enable real-time searching without reopening readers; Michael Busch describes them &lt;a href="http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html"&gt;here&lt;/a&gt; and &lt;a href="http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter"&gt;here&lt;/a&gt;.  Some day I hope these will be folded into Lucene!&lt;br /&gt;&lt;br /&gt;Finally, we test all sorts of queries: &lt;tt&gt;PhraseQuery&lt;/tt&gt; (&lt;a href="http://people.apache.org/~mikemccand/lucenebench/Phrase.html"&gt;exact&lt;/a&gt; and &lt;a href="http://people.apache.org/~mikemccand/lucenebench/SloppyPhrase.html"&gt;sloppy&lt;/a&gt;), &lt;tt&gt;FuzzyQuery&lt;/tt&gt; (edit distance &lt;a href="http://people.apache.org/~mikemccand/lucenebench/Fuzzy1.html"&gt;1&lt;/a&gt; and &lt;a href="http://people.apache.org/~mikemccand/lucenebench/Fuzzy2.html"&gt;2&lt;/a&gt;), &lt;a href="http://people.apache.org/~mikemccand/lucenebench/AndHighHigh.html"&gt;four&lt;/a&gt; &lt;a href="http://people.apache.org/~mikemccand/lucenebench/AndHighMed.html"&gt;variants&lt;/a&gt; &lt;a href="http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html"&gt;of&lt;/a&gt; &lt;a href="http://people.apache.org/~mikemccand/lucenebench/OrHighMed.html"&gt;&lt;tt&gt;BooleanQuery&lt;/tt&gt;&lt;/a&gt;, &lt;a href="http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html"&gt;&lt;tt&gt;NumericRangeQuery&lt;/tt&gt;&lt;/a&gt;, &lt;a href="http://people.apache.org/~mikemccand/lucenebench/Prefix3.html"&gt;&lt;tt&gt;PrefixQuery&lt;/tt&gt;&lt;/a&gt;, &lt;a href="http://people.apache.org/~mikemccand/lucenebench/Wildcard.html"&gt;&lt;tt&gt;WildcardQuery&lt;/tt&gt;&lt;/a&gt;, &lt;a href="http://people.apache.org/~mikemccand/lucenebench/SpanNear.html"&gt;&lt;tt&gt;SpanNearQuery&lt;/tt&gt;&lt;/a&gt;, and of course &lt;a href="http://people.apache.org/~mikemccand/lucenebench/Term.html"&gt;&lt;tt&gt;TermQuery&lt;/tt&gt;&lt;/a&gt;. In addition we test the &lt;a href="http://people.apache.org/~mikemccand/lucenebench/Respell.html"&gt;automaton spell checker&lt;/a&gt;, and &lt;a href="http://people.apache.org/~mikemccand/lucenebench/PKLookup.html"&gt;primary-key lookup&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;A few days ago, I switched all tests to the very fast 240 GB &lt;a href="http://www.ocztechnology.com/ocz-vertex-3-sata-iii-2-5-ssd.html"&gt;OCZ Vertex 3&lt;/a&gt; (previously it was a traditional spinning-magnets hard drive).  It looks like indexing throughput gained a bit of performance (~102 GB plain text per hour), the search performance was unaffected (expected, because for this test all postings easily fit in available RAM), but the NRT turnaround time saw a drastic reduction in the noise to near-zero. NRT is very IO intensive so it makes sense having a fast IO system improves its turnaround time; I need to dig further into this.&lt;br /&gt;&lt;br /&gt;Unfortunately, performance results are inherently noisy.  For example you can see the large noise (the error band is +/- one standard deviation) in the &lt;a href="http://people.apache.org/~mikemccand/lucenebench/Term.html"&gt;&lt;tt&gt;TermQuery&lt;/tt&gt; results&lt;/a&gt;; other queries seem to have less noise for some reason.&lt;br /&gt;&lt;br /&gt;So far the graphs are rather boring: nice and flat.  This is a good thing!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8418209192171448474?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8418209192171448474/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/04/catching-slowdowns-in-lucene.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8418209192171448474'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8418209192171448474'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/04/catching-slowdowns-in-lucene.html' title='Catching slowdowns in Lucene'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7470278319757935222</id><published>2011-04-24T09:05:00.003-04:00</published><updated>2011-04-24T11:21:26.706-04:00</updated><title type='text'>Just say no to swapping!</title><content type='html'>Imagine you love to cook; it's an intense hobby of yours.  Over time, you've accumulated many &lt;a href="http://en.wikipedia.org/wiki/Spice"&gt;fun spices&lt;/a&gt;, but your pantry is too small, so, you rent an off-site storage facility, and move the less frequently used spice racks there.  Problem solved!&lt;br /&gt;&lt;br /&gt;Suddenly you decide to cook this great new recipe.  You head to the pantry to retrieve your &lt;a href="http://en.wikipedia.org/wiki/Saffron"&gt;Saffron&lt;/a&gt;, but it's not there!  It was moved out to the storage facility and must now be retrieved (this is a &lt;a href="http://en.wikipedia.org/wiki/Page_fault"&gt;hard page fault&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;No problem -- your neighbor volunteers to go fetch it for you. Unfortunately, the facility is ~2,900 miles away, all the way across the US, so it takes your friend 6 days to retrieve it!&lt;br /&gt;&lt;br /&gt;This assumes you normally take 7 seconds to retrieve a spice from the pantry; that your data was in main memory (~100 nanoseconds access time), not in the &lt;a href="http://en.wikipedia.org/wiki/CPU_cache"&gt;CPU's caches&lt;/a&gt; (which'd be maybe 10 nanoseconds); that your swap file is on a fast (say, &lt;a href="http://en.wikipedia.org/wiki/Western_Digital_Raptor"&gt;WD Raptor&lt;/a&gt;) spinning-magnets hard drive with 5 millisecond average access time; and that your neighbor drives non-stop at 60 mph to the facility and back.&lt;br /&gt;&lt;br /&gt;Even worse, your neighbor drives a motorcycle, and so he can only retrieve one spice rack at a time.  So, after waiting 6 days for the Saffron to come back, when you next go to the pantry to get some &lt;a href="http://en.wikipedia.org/wiki/Paprika"&gt;Paprika&lt;/a&gt;, it's also "swapped out" and you must wait another 6 days!  It's possible that first spice rack also happened to have the Paprika but it's also likely it did not; that depends on your &lt;a href="http://en.wikipedia.org/wiki/Locality_of_reference"&gt;spice locality&lt;/a&gt;.  Also, with each trip, your neighbor must pick a spice rack to move out to the facility, so that the returned spice rack has a place to go (it is a "swap", after all), so the Paprika could have just been swapped out!&lt;br /&gt;&lt;br /&gt;Sadly, it might easily be many weeks until you succeed in cooking your dish.&lt;br /&gt;&lt;br /&gt;Maybe in the olden days, when memory itself was a &lt;a href="http://en.wikipedia.org/wiki/Magnetic-core_memory"&gt;core of little magnets&lt;/a&gt;, swapping cost wasn't so extreme, but these days, as memory access time has improved drastically while hard drive access time hasn't budged, the disparity is now unacceptable.  Swapping has become a &lt;a href="http://www.joelonsoftware.com/articles/LeakyAbstractions.html"&gt;badly leaking abstraction&lt;/a&gt;.  When a typical process (say, your e-mail reader) has to "swap back in" after not being used for a while, it can hit 100s of such page faults, before finishing redrawing its window.  It's an awful experience, though it has the fun side effect of letting you see, in slow motion, just what precise steps your email reader goes through when redrawing its window.&lt;br /&gt;&lt;br /&gt;Swapping is especially disastrous with JVM processes.  See, the JVM generally won't do a &lt;a href="http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)"&gt;full GC cycle&lt;/a&gt; until it has run out of its allowed heap, so most of your heap is likely occupied by not-yet-collected garbage.  Since these pages aren't being touched (because they are garbage and thus unreferenced), the OS happily swaps them out.  When GC finally runs, you have a ridiculous swap storm, pulling in all these pages only to then discover that they are in fact filled with garbage and should be discarded; this can easily make your GC cycle take many minutes!&lt;br /&gt;&lt;br /&gt;It'd be better if the JVM could work more closely with the OS so that GC would somehow run on-demand whenever the OS wants to start swapping so that, at least, we never swap out garbage.  Until then, make sure you don't set your JVM's heap size too large!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Just use an &lt;a href="http://en.wikipedia.org/wiki/Solid-state_drive"&gt;SSD&lt;/a&gt;...&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;These days, many machines ship with &lt;a href="http://en.wikipedia.org/wiki/Solid-state_drive"&gt;solid state disks&lt;/a&gt;, which are an astounding (though still costly) improvement over spinning magnets; once you've used an SSD you can never go back; it's just one of life's many one-way doors.&lt;br /&gt;&lt;br /&gt;You might be tempted to declare that this problem is solved, since SSDs are so blazingly fast, right?  Indeed, they are orders of magnitudes faster than spinning magnets, but they are still 2-3 orders of magnitude slower than main memory or CPU cache.  The typical SSD might have 50 microsends access time, which equates to ~58 total miles of driving at 60 mph.  Certainly a huge improvement, but still unacceptable if you want to cook your dish on time!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Just add RAM...&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Another common workaround is to put lots of RAM in your machine, but this can easily back-fire: operating systems will happily swap out memory pages in favor of caching IO pages, so if you have any processes accessing lots of bytes (say, mencoder encoding a 50 GB bluray movie, maybe a virus checker or backup program, or even Lucene searching against a large index or doing a large merge), the OS will swap your pages out.  This then means that the more RAM you have, the more swapping you get, and the problem only gets worse!&lt;br /&gt;&lt;br /&gt;Fortunately, some OS's let you control this behavior: on Linux, you can &lt;a href="http://kerneltrap.org/node/3000"&gt;tune swappiness down to 0&lt;/a&gt; (most Linux distros default this to a highish number); Windows also has a checkbox, under My Computer -&gt; Properties -&gt; Advanced -&gt; Performance Settings -&gt; Advanced -&gt; Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.&lt;br /&gt;&lt;br /&gt;There are &lt;a href="http://linux.die.net/man/2/madvise"&gt;low-level&lt;/a&gt; &lt;a href="http://linux.die.net/man/2/fadvise"&gt;IO flags&lt;/a&gt; that these programs are supposed to use so that the OS knows not to cache the pages they access, but sometimes the processes fail to use them or cannot use them (for example, they are &lt;a href="http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html"&gt;not yet exposed to Java&lt;/a&gt;), and even if they do, sometimes the OS &lt;a href="http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html"&gt;ignores them&lt;/a&gt;!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;When swapping is OK&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If your computer never runs any interactive processes, ie, a process where a human is blocked (waiting) on the other end for something to happen, and only runs batch processes which tend to be active at different times, then swapping can be an overall win since it allows that process which is active to make nearly-full use of the available RAM.  Net/net, over time, this will give greater overall throughput for the batch processes on the machine.&lt;br /&gt;&lt;br /&gt;But, remember that the server running your web-site is an interactive process; if your server processes (web/app server, database, search server, etc.) are stuck swapping, your site has for all intents and purposes become unusable to your users.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;This is a fixable problem&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Most processes have known data structures that consume substantial RAM, and in many cases these processes could easily discard and later regenerate their data structures in much less time than even a single page fault.  Caches can simply be pruned or discarded since they will self-regenerate over time.&lt;br /&gt;&lt;br /&gt;These data structures should never be swapped out, since regeneration is far cheaper.  Somehow the OS should ask each RAM-intensive and least-recently-accessed process to discard its data structures to free up RAM, instead of swapping out the pages occupied by the data structure.  Of course, this would require a tighter interaction between the OS and processes than exists today; Java's &lt;a href="http://download.oracle.com/javase/1.4.2/docs/api/java/lang/ref/SoftReference.html"&gt;&lt;tt&gt;SoftReference&lt;/tt&gt;&lt;/a&gt; is close, except this only works within a single JVM, and does not interact with the OS.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;What can you do?&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Until this problem is solved for real, the simplest workaround is to disable swapping entirely, and stuff as much RAM as you can into the machine.  RAM is cheap, memory modules are dense, and modern motherboards accept many modules.  This is what I do.&lt;br /&gt;&lt;br /&gt;Of course, with this approach, when you run out of RAM stuff will start failing.  If the software is well written, it'll fail gracefully: your browser will tell you it cannot open a new window or visit a new page.  If it's poorly written it will simply crash, thus quickly freeing up RAM and hopefully not losing any data or corrupting any files in the process.  Linux takes the &lt;a href="http://linux-mm.org/OOM_Killer"&gt;simple draconian approach of picking a memory hogging process and SIGKILL'ing it&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;If you don't want to disable swapping you should at least tell the OS not to swap pages out for IO caching.&lt;br /&gt;&lt;br /&gt;Just say no to swapping!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7470278319757935222?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7470278319757935222/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/04/just-say-no-to-swapping.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7470278319757935222'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7470278319757935222'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/04/just-say-no-to-swapping.html' title='Just say no to swapping!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-116762424330433452</id><published>2011-03-31T06:32:00.002-04:00</published><updated>2011-03-31T06:49:03.642-04:00</updated><title type='text'>A login-wall is nearly as bad as a pay-wall!</title><content type='html'>&lt;a href="http://www.businessinsider.com/exclusive-qa-quora-may-be-turning-down-billion-dollar-offers-but-its-still-losing-to-this-guy-2011-2"&gt;Much&lt;/a&gt; has been &lt;a href="http://www.techfounder.net/2011/02/01/my-take-on-quora-vs-stackoverflow-or-substance-vs-social/"&gt;said&lt;/a&gt; and &lt;a href="http://meta.stackoverflow.com/questions/44618/what-can-we-learn-from-quora"&gt;asked&lt;/a&gt; about &lt;a href="http://socialcompare.com/en/comparison/compare-question-answer-sites-quora-vs-yahoo-answers-vs-stackoverflow-vs-ted-conversations"&gt;the differences&lt;/a&gt; between &lt;a href="http://stackoverflow.com"&gt;Stack Overflow&lt;/a&gt; and &lt;a href="http://quora.com"&gt;Quora&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;And, while there are deep and interesting differences, such as how Stack Overflow makes &lt;a href="http://stackoverflow.com/faq"&gt;reputation tracking&lt;/a&gt; and &lt;a href="http://meta.stackoverflow.com/questions/17853/how-do-badges-work"&gt;badges&lt;/a&gt; explicit, in my opinion, one simple difference is the most important of all: Quora's login-wall.&lt;br /&gt;&lt;br /&gt;See, you cannot do anything with Quora until you've registered, while with Stack Overflow you can do almost everything without registering.  They are polar opposites!&lt;br /&gt;&lt;br /&gt;Like everyone else, I have too much curiosity and too little time.  I try to keep up on &lt;a href="http://news.ycombinator.com"&gt;Hacker News&lt;/a&gt; (sorry &lt;a href="http://digg.com"&gt;Digg&lt;/a&gt; and &lt;a href="http://reddit.com"&gt;Reddit&lt;/a&gt;): I click through to the cool stuff, and then move on.  You have one precious first page impression to rope me in, so don't spend that impression with a login-wall!&lt;br /&gt;&lt;br /&gt;I mean, sure, I'm still going to go link up my &lt;a href="http://facebook.com"&gt;Facebook&lt;/a&gt; account so I can login to Quora and see the questions, answers, conversations. (And, yes, Facebook seems to be winning at the "universal ID" game, even though I like &lt;a href="http://openid.net"&gt;OpenID&lt;/a&gt; better.)  Still, for each persistent user like me, you've lost 9 non-persistent ones with that dreaded login-wall.&lt;br /&gt;&lt;br /&gt;Remember: if you are are a new cool Web site, gaining value from the &lt;a href="http://en.wikipedia.org/wiki/Network_effect"&gt;network effect&lt;/a&gt; (as all social sites do), trying to eek out just a tiny slice of all these fickle users jumping around out here, don't put up a login-wall!  It's just about &lt;a href="http://www.techdirt.com/articles/20090707/0207585464.shtml"&gt;as bad as a paywall&lt;/a&gt;.  Let brand new users do as much as possible with your site, and make that very first page impression count.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-116762424330433452?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/116762424330433452/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/03/login-wall-is-nearly-as-bad-as-pay-wall.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/116762424330433452'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/116762424330433452'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/03/login-wall-is-nearly-as-bad-as-pay-wall.html' title='A login-wall is nearly as bad as a pay-wall!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-9016538941732218950</id><published>2011-03-26T08:44:00.003-04:00</published><updated>2011-03-26T12:49:57.317-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Your test cases should sometimes fail!</title><content type='html'>I'm an avid subscriber of the delightful weekly (sometimes) Python-URL! email, highlighting the past week's interesting discussions across the numerous &lt;a href="http://mail.python.org/mailman/listinfo"&gt;Python lists&lt;/a&gt;. Each summary starts with the best quote from the week; here's &lt;a href="http://groups.google.com/group/comp.lang.python/browse_thread/thread/a5c7bd62047263e0/036e2be279cb78f9?lnk=raot&amp;fwc=2"&gt;last week's quote&lt;/a&gt;:&lt;blockquote&gt;"So far as I know, that actually just means that the test suite is insufficient." - Peter Seebach, when an application passes all its tests.&lt;/blockquote&gt;I wholeheartedly agree: if your build always passes its tests, that means your tests are not tough enough!  Ideally the tests should stay ahead of the software, constantly pulling you forwards to improve its quality.  If the tests keep passing, write new ones that fail!  Or make existing ones evil-er.&lt;br /&gt;&lt;br /&gt;You'll be glad to know that Lucene/Solr's tests do sometimes fail, as you can see in the &lt;a href="http://jenkins-ci.org/content/hudsons-future"&gt;&lt;del&gt;Hudson&lt;/del&gt;&lt;/a&gt; Jenkins &lt;a href="https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk"&gt;automated trunk builds&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Randomized testing&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Our test infrastructure has gotten much better, just over the past 6 months or so, through heavy use of randomization.&lt;br /&gt;&lt;br /&gt;When a test needs a &lt;tt&gt;Directory&lt;/tt&gt; instance, but doesn't care which, it uses the &lt;tt&gt;newDirectory&lt;/tt&gt; method.  This method picks one of Lucene's &lt;tt&gt;Directory&lt;/tt&gt; implementations (&lt;tt&gt;RAMDirectory&lt;/tt&gt;, &lt;tt&gt;NIOFSDirectory&lt;/tt&gt;, &lt;tt&gt;MMapDirectory&lt;/tt&gt;, etc.) and then wraps it with &lt;tt&gt;MockDirectoryWrapper&lt;/tt&gt;, a nice little class that does all sorts of fun things like: occasionally calling &lt;tt&gt;Thread.yield&lt;/tt&gt;; preventing still-open files from being overwritten or deleted (acts-like-Windows); refusing to write to the same file twice (verifying Lucene is in fact write-once); breaking up a single &lt;tt&gt;writeBytes&lt;/tt&gt; into multiple calls; optionally throwing &lt;tt&gt;IOException&lt;/tt&gt; on disk full, or simply throwing exceptions at random times; simulating an OS/hardware crash by randomly corrupting un-&lt;tt&gt;sync&lt;/tt&gt;'d files in devilish ways; etc. We pick a timezone and locale.&lt;br /&gt;&lt;br /&gt;To randomize indexing, we create a &lt;tt&gt;IndexWriterConfig&lt;/tt&gt;, tweaking all sorts of settings, and use &lt;tt&gt;RandomIndexWriter&lt;/tt&gt; (like &lt;tt&gt;IndexWriter&lt;/tt&gt;, except it sometimes optimizes, commits, yields, etc.).  The &lt;tt&gt;newField&lt;/tt&gt; method enables or disables stored fields and term vectors.  We create random codecs, per field, by combining a terms dictionary with a random terms index and postings implementations.  &lt;tt&gt;MockAnalyzer&lt;/tt&gt; injects payloads into its tokens.&lt;br /&gt;&lt;br /&gt;Sometimes we use the &lt;tt&gt;PreFlex&lt;/tt&gt; codec, to writes all indices in the 3.x format (so that we test index backwards compatibility), and sometimes the nifty &lt;a href="http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html"&gt;SimpleText codec&lt;/a&gt;.  We have exotic methods for creating random yet somewhat realistic full Unicode strings.  When creating an &lt;tt&gt;IndexSearcher&lt;/tt&gt;, we might use threads (pass an &lt;tt&gt;ExecutorService&lt;/tt&gt;), or not.  We catch tests that leave threads running, or that cause &lt;a href="http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/util/FieldCacheSanityChecker.Insanity.html"&gt;insanity&lt;/a&gt; in the &lt;tt&gt;FieldCache&lt;/tt&gt; (for example by loading both parent and sub readers).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Reproducibility&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;To ensure a failure is reproducible, we save the random seeds and on a failure print out a nice line like this: &lt;blockquote&gt;&lt;tt&gt;&lt;nobr&gt;NOTE: reproduce with: ant test&lt;/nobr&gt; &lt;nobr&gt;-Dtestcase=TestFieldCacheTermsFilter&lt;/nobr&gt; &lt;nobr&gt;-Dtestmethod=testMissingTerms&lt;/nobr&gt; &lt;nobr&gt;-Dtests.seed=-1046382732738729184:5855929314778232889&lt;/nobr&gt;&lt;/tt&gt;&lt;/blockquote&gt;  This fixes the seed so that the test runs deterministically.  Sometimes, horribly, we have bugs in this seed logic, thus causing tests to &lt;b&gt;not&lt;/b&gt; run deterministically and we scramble to fix those bugs first!&lt;br /&gt;&lt;br /&gt;If you happen to hit a test failure, please send that precious line to the dev list!  This is like the &lt;a href="http://setiathome.berkeley.edu"&gt;Search for Extraterrestrial Intelligence (SETI)&lt;/a&gt;: there are some number of random seeds out there (hopefully, not too many!), that will lead to a failure, and if your computer is lucky enough to discover one of these golden seeds, please share the discovery!&lt;br /&gt;&lt;br /&gt;The merging of Lucene and Solr's development was also a big step forward for test coverage, since every change in Lucene is now tested against all of Solr's test cases as well.&lt;br /&gt;&lt;br /&gt;Tests accept a multiplier to crank things up, causing them to use more test documents or iterations, run for longer time, etc.  We now have perpetual jobs on Jenkins, for both 3.x and trunk, launching every 15 minutes with multiplier 5.  We know quickly when someone breaks the build!&lt;br /&gt;&lt;br /&gt;This added test coverage has already caught a number of sneaky bugs (including a &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2593"&gt;rare index corruption case on disk-full&lt;/a&gt; and a &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2627"&gt;chunking bug in &lt;tt&gt;MMapDirectory&lt;/tt&gt;&lt;/a&gt;) that we otherwise would not have discovered for some time.&lt;br /&gt;&lt;br /&gt;The test infrastructure itself is so useful that it's now been &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2609"&gt;factored out as a standalone JAR&lt;/a&gt; so apps using Lucene can tap into it to create their own fun randomized tests.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-9016538941732218950?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/9016538941732218950/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/03/your-test-cases-should-sometimes-fail.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9016538941732218950'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9016538941732218950'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/03/your-test-cases-should-sometimes-fail.html' title='Your test cases should sometimes fail!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5572735410631506151</id><published>2011-03-24T09:16:00.011-04:00</published><updated>2011-03-26T06:05:05.358-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene's FuzzyQuery is 100 times faster in 4.0</title><content type='html'>There are many exciting improvements in Lucene's eventual 4.0 (trunk) release, but the awesome speedup to &lt;tt&gt;FuzzyQuery&lt;/tt&gt; really stands out, not only from its incredible gains but also because of the amazing behind-the-scenes story of how it all came to be.&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;FuzzyQuery&lt;/tt&gt; matches terms "close" to a specified base term: you specify an allowed maximum &lt;a href="http://en.wikipedia.org/wiki/Levenshtein_distance"&gt;edit distance&lt;/a&gt;, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.&lt;br /&gt;&lt;br /&gt;The &lt;tt&gt;QueryParser&lt;/tt&gt; syntax is &lt;tt&gt;term~&lt;/tt&gt; or &lt;tt&gt;term~N&lt;/tt&gt;, where &lt;tt&gt;N&lt;/tt&gt; is the maximum allowed number of edits (for older releases &lt;tt&gt;N&lt;/tt&gt; was a confusing float between &lt;tt&gt;0.0&lt;/tt&gt; and &lt;tt&gt;1.0&lt;/tt&gt;, which translates to an equivalent max edit distance through a tricky formula).&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;FuzzyQuery&lt;/tt&gt; is great for matching proper names: I can search for &lt;tt&gt;mcandless~1&lt;/tt&gt; and it will match &lt;tt&gt;mccandless&lt;/tt&gt; (insert &lt;tt&gt;c&lt;/tt&gt;), &lt;tt&gt;mcandles&lt;/tt&gt; (remove &lt;tt&gt;s&lt;/tt&gt;), &lt;tt&gt;mkandless&lt;/tt&gt; (replace &lt;tt&gt;c&lt;/tt&gt; with &lt;tt&gt;k&lt;/tt&gt;) and a great many other "close" terms.  With max edit distance 2 you can have up to 2 insertions, deletions or substitutions.  The score for each match is based on the edit distance of that term; so an exact match is scored highest; edit distance 1, lower; etc.&lt;br /&gt;&lt;br /&gt;Prior to 4.0, &lt;tt&gt;FuzzyQuery&lt;/tt&gt; took the simple yet horribly costly brute force approach: it visits every single unique term in the index, computes the edit distance for it, and accepts the term (and its documents) if the edit distance is low enough.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The journey begins&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The long journey began when &lt;a href="http://rcmuir.wordpress.com/"&gt;Robert Muir&lt;/a&gt; had the idea of pre-building a &lt;a href="http://en.wikipedia.org/wiki/Levenshtein_automaton"&gt;Levenshtein Automaton&lt;/a&gt;, a deterministic automaton (DFA) that accepts only the terms within edit distance &lt;tt&gt;N&lt;/tt&gt;.  Doing this, up front, and then intersecting that automaton with the terms in the index, should give a massive speedup, he reasoned.&lt;br /&gt;&lt;br /&gt;At first he built a simple prototype, explicitly unioning the separate DFAs that allow for up to &lt;tt&gt;N&lt;/tt&gt; insertions, deletions and substitutions.  But, unfortunately, just building that DFA (let alone then intersecting it with the terms in the index), was too slow.&lt;br /&gt;&lt;br /&gt;Fortunately, after some Googling, he discovered &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652"&gt;a paper&lt;/a&gt;, by Klaus Schulz and Stoyan Mihov (now famous among the Lucene/Solr committers!) detailing an efficient algorithm for building the Levenshtein Automaton from a given base term and max edit distance.  All he had to do is code it up!  It's just software after all.  Somehow, he roped &lt;a href="http://www.lucidimagination.com/blog/author/markmiller"&gt;Mark Miller&lt;/a&gt;, another Lucene/Solr committer, into helping him do this.&lt;br /&gt;&lt;br /&gt;Unfortunately, the paper was nearly unintelligible!  It's 67 pages, filled with all sorts of equations, Greek symbols, definitions, propositions, lemmas, proofs. It uses scary concepts like Subsumption Triangles, along with beautiful yet still unintelligible diagrams.  Really the paper may as well have been written in Latin.&lt;br /&gt;&lt;br /&gt;Much coffee and beer was consumed, sometimes simultaneously.  Many hours were spent on IRC, staying up all night, with Mark and Robert carrying on long conversations, which none of the rest of us could understand, trying desperately to decode the paper and turn it into Java code.  Weeks went by like this and they actually had made some good initial progress, managing to loosely crack the paper to the point where they had a test implementation of the &lt;tt&gt;N=1&lt;/tt&gt; case, and it seemed to work.  But generalizing that to the &lt;tt&gt;N=2&lt;/tt&gt; case was... daunting.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The breakthrough&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Then, finally, a breakthrough!  Robert found, after even more Googling, an existence proof, in an unexpected place: an open-source package, &lt;a href="http://sites.google.com/site/rrettesite/moman"&gt;Moman&lt;/a&gt;, under the generous &lt;a href="http://en.wikipedia.org/wiki/MIT_License"&gt;MIT license&lt;/a&gt;.  The author, &lt;a href="http://www.linkedin.com/pub/jean-philippe-barrette-lapierre/5/21/ab3"&gt;Jean-Phillipe Barrette-LaPierre&lt;/a&gt;, had somehow, incredibly, magically, quietly, implemented the algorithm from this paper. And this was apparently a random side project for him, unrelated to his day job.  So now we knew it was possible (and we all have deep admiration for Jean-Phillipe!).&lt;br /&gt;&lt;br /&gt;We decided to simply re-use Moman's implementation to accomplish our goals.  But, it turns out, its source code is all &lt;a href="http://www.python.org"&gt;Python&lt;/a&gt; (my favorite programming language)!  And, nearly as hairy as the paper itself.  Nevertheless, we pushed on.&lt;br /&gt;&lt;br /&gt;Not really understanding the Python code, and also neither the paper, we desperately tried to write our own Python code to tap into the various functions embedded in Moman's code, to auto-generate Java code containing the necessary tables for each max edit distance case (&lt;tt&gt;N=1&lt;/tt&gt;, &lt;tt&gt;N=2&lt;/tt&gt;, etc.).  We had to guess what each Python function did, by its name, trying to roughly match this up to the spooky terminology in the paper.&lt;br /&gt;&lt;br /&gt;The result was &lt;a href="https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py"&gt;createLevAutomata.py&lt;/a&gt;: it auto-generates crazy looking Java code (see &lt;a href="https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java"&gt;Lev2ParametricDescription.java&lt;/a&gt;, and scroll to the cryptic packed tables at the bottom), which in turn is used by &lt;a href="https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java"&gt;further Java code&lt;/a&gt; to create the Levenshtein automaton per-query.  We only generate the &lt;tt&gt;N=1&lt;/tt&gt; and &lt;tt&gt;N=2&lt;/tt&gt; cases (the &lt;tt&gt;N&amp;gt;=3&lt;/tt&gt; cases aren't really practical, at least not yet).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The last bug...&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Realize, now, what a crazy position we were in.  We wrote our own scary Python code, tapping into various functions in the Moman package, to auto-generate unreadable Java code with big tables of numbers, which is then used to generate Levenshtein automata from the base term and &lt;tt&gt;N&lt;/tt&gt;.  We went through many iterations with this crazy chain of Python and Java code that we barely understood, slowly iterating to get the bugs out.&lt;br /&gt;&lt;br /&gt;After fixing many problems, we still had one persistent bug which we just couldn't understand, let alone fix.  We struggled for several days, assuming the bug was in our crazy Python/Java chain. Finally, we considered the possibility that the bug was in Moman, and indeed Robert managed to reduce the problem to a tiny Python-only case showing where Moman failed to match the right terms.  Robert sent this example to Jean-Phillipe, who quickly confirmed the bug and posted &lt;a href="http://groups.google.com/group/moman/browse_thread/thread/16c09d659242c142"&gt;a patch&lt;/a&gt; the next day.  We applied his patch and suddenly everything was working perfectly!&lt;br /&gt;&lt;br /&gt;Fortunately, while this fast &lt;tt&gt;FuzzyQuery&lt;/tt&gt; was unbelievably hairy to implement, testing it well is relatively easy since we can validate it against the brute-force enumeration from &lt;tt&gt;3.0&lt;/tt&gt;.  We have several tests verifying the different layers executed by the full &lt;tt&gt;FuzzyQuery&lt;/tt&gt;.  The tests are exhaustive in that they test all structurally different cases possible in the Levenshtein construction, using a binary (only characters &lt;tt&gt;0&lt;/tt&gt; and &lt;tt&gt;1&lt;/tt&gt;) terms.&lt;br /&gt;&lt;br /&gt;Beyond just solving this nearly impossible task of efficiently compiling a term to a Levenshtein Automaton, we had many other parts to fill in.  For example, Robert separately created a general &lt;tt&gt;AutomatonQuery&lt;/tt&gt;, re-using infrastructure from the open-source &lt;a href="http://www.brics.dk/automaton/"&gt;Brics&lt;/a&gt; automaton package, to enable fast intersection of an automaton against all terms and documents in the index. This query is now used to handle &lt;tt&gt;WildcardQuery&lt;/tt&gt;, &lt;tt&gt;RegexpQuery&lt;/tt&gt;, and &lt;tt&gt;FuzzyQuery&lt;/tt&gt;.  It's also useful for custom cases, too; for example it's used by &lt;a href="http://lucene.apache.org/solr/"&gt;Solr&lt;/a&gt; to reverse wildcard queries.  &lt;a href="http://www.slideshare.net/otisg/finite-state-queries-in-lucene"&gt;These slides from Robert&lt;/a&gt; describe &lt;tt&gt;AutomatonQuery&lt;/tt&gt;, and its fun possible use case, in more detail.&lt;br /&gt;&lt;br /&gt;Separately, we had an impedance mismatch: these automatons speak full unicode (&lt;tt&gt;UTF32&lt;/tt&gt;) characters, yet Lucene's terms are stored in &lt;tt&gt;UTF8&lt;/tt&gt; bytes, so we had to create a &lt;tt&gt;UTF32 -&gt; UTF8&lt;/tt&gt; automaton converter, which by itself was also very hairy!  That converter translates any &lt;tt&gt;UTF32&lt;/tt&gt; automaton into an equivalent &lt;tt&gt;UTF8&lt;/tt&gt; Levenshtein automaton, which can be directly intersected against the terms in the index.&lt;br /&gt;&lt;br /&gt;So, today, when you run a &lt;tt&gt;FuzzyQuery&lt;/tt&gt; in 4.0, it efficiently seeks and scans only those regions of the term space which may have matches, guided by the Levenshtein automaton.  This, coupled with ongoing performance improvements to seeking and scanning terms, as well as other major improvements like &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2690"&gt;performing MultiTermQuery rewrites per-segment&lt;/a&gt;, has given us the astounding overall gains in &lt;tt&gt;FuzzyQuery&lt;/tt&gt;.&lt;br /&gt;&lt;br /&gt;Thanks to these enormous performance improvements, Robert has created an entirely &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2507"&gt;new automaton spell checker&lt;/a&gt; that uses this same algorithm to find candidate terms for respelling.  This is just like &lt;tt&gt;FuzzyQuery&lt;/tt&gt;, except it doesn't visit the matching documents. This is a big improvement over the &lt;a href="http://lucene.apache.org/java/3_0_1/api/contrib-spellchecker/index.html"&gt;existing spellchecker&lt;/a&gt; as it does not require a separate spellchecker index be maintained.&lt;br /&gt;&lt;br /&gt;This whole exciting experience is a great example of why open-source development works so well.  Here we have diverse committers from Lucene/Solr, bringing together their various unusual strengths (automatons, Unicode, Python, etc.) to bear on an insanely hard challenge, leveraging other potent open-source packages including Moman and Brics, iterating with the authors of these packages to resolve bugs.  No single person involved in this really understands all of the parts; it's truly a team effort.&lt;br /&gt;&lt;br /&gt;And now you know what's going on under the hood when you see incredible speedups with &lt;tt&gt;FuzzyQuery&lt;/tt&gt; in 4.0!&lt;br /&gt;&lt;br /&gt;[For the not-faint-of-heart, you can browse &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1606"&gt;LUCENE-1606&lt;/a&gt; to see parts of this story unfolding through Jira]&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5572735410631506151?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5572735410631506151/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html#comment-form' title='35 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5572735410631506151'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5572735410631506151'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html' title='Lucene&apos;s FuzzyQuery is 100 times faster in 4.0'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>35</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-2463937299427783976</id><published>2011-03-11T15:25:00.004-05:00</published><updated>2011-03-30T13:46:58.775-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Hack on Lucene this summer!</title><content type='html'>Are you a student?  Looking to do some fun coding this summer?  Then join us for the 2011 &lt;a href="http://wiki.apache.org/lucene-java/SummerOfCode2011"&gt;Google Summer of Code&lt;/a&gt;!&lt;br /&gt;&lt;br /&gt;The application deadline is in &lt;a href="http://www.google-melange.com/document/show/gsoc_program/google/gsoc2011/faqs#timeline"&gt;less than a month&lt;/a&gt;!  Lucene has these &lt;a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;jqlQuery=labels+%3D+lucene-gsoc-11"&gt;initial&lt;br /&gt;potential projects&lt;/a&gt; identified, but you can also pick your own; just be sure to discuss with the community first (send an email to &lt;a href="mailto:dev@lucene.apache.org"&gt;dev@lucene.apache.org&lt;/a&gt;).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-2463937299427783976?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/2463937299427783976/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/03/hack-on-lucene-this-summer.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2463937299427783976'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2463937299427783976'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/03/hack-on-lucene-this-summer.html' title='Hack on Lucene this summer!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-2515221350987474493</id><published>2011-02-12T11:35:00.007-05:00</published><updated>2011-02-12T12:45:29.647-05:00</updated><title type='text'>So many icicles</title><content type='html'>It's that cold time of year again -- that's right, &lt;a href="http://en.wikipedia.org/wiki/Winter"&gt;winter&lt;/a&gt;! We have lots of snow, freezing temperatures, and... incredible &lt;a href="http://en.wikipedia.org/wiki/Icicle"&gt;icicles&lt;/a&gt; hanging off houses:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-eFTmeUabVB0/TVa3Lr5R7KI/AAAAAAAAAFs/sj43J-rVTCg/s1600/5.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 389px;" src="http://4.bp.blogspot.com/-eFTmeUabVB0/TVa3Lr5R7KI/AAAAAAAAAFs/sj43J-rVTCg/s1600/5.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5572843000306986146" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Conditions have to be just right for these icicles to form.  First, you need lots of snow accumulated on roofs.  Second, you need below-freezing temperatures for many days in a row.  Finally, of course, you need a house that loses lots of heat through its attic/roof.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-i_9o4ifSbzE/TVa4J2HQ7KI/AAAAAAAAAF0/ZPTv77JSeHg/s1600/3.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 600px; height: 387;" src="http://4.bp.blogspot.com/-i_9o4ifSbzE/TVa4J2HQ7KI/AAAAAAAAAF0/ZPTv77JSeHg/s1600/3.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5572844068201884834" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;These icicles are an easy way to spot houses that waste heat, something that's otherwise not normally easy to detect!  In the winter, here in New England, such houses stand out very clearly.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-YJddixPZam8/TVa4dnPl4qI/AAAAAAAAAF8/8qqSQLib7ic/s1600/1.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 548px; height: 600px;" src="http://4.bp.blogspot.com/-YJddixPZam8/TVa4dnPl4qI/AAAAAAAAAF8/8qqSQLib7ic/s1600/1.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5572844407807664802" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Of course, roof snow does also melt for legitimate reasons.  For example, when the temperature outside moves above freezing, the snow will melt.  However, the resulting water simply falls off the roof. It's only when ambient temperature is below freezing, yet the snow is still being melted (from waste heat leaking through the roof) that you get immense icicles.  The water dribbles down the roof, under the snow, and upon hitting the roof's edge / gutter overhang, which is not wasting heat, it freezes.&lt;br /&gt;&lt;br /&gt;Given enough snow, wasted heat, below freezing temperatures, and time, you'll get amazing icicles.  I can understand that older homes will have poorer insulation and thus waste heat: standards were more lax back then, and we generally were not as environmentally conscious as we are today.  But, when I see massive icicles on new construction, it's disappointing:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-FmmZRIgxR_M/TVa4n6XBi3I/AAAAAAAAAGE/LwbKkyYgGsM/s1600/2.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 504px; height: 600px;" src="http://1.bp.blogspot.com/-FmmZRIgxR_M/TVa4n6XBi3I/AAAAAAAAAGE/LwbKkyYgGsM/s1600/2.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5572844584737803122" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;These icicles are very firmly attached to the roof/gutter, since Mother Nature carefully froze them there one drop at a time.  However, as pretty as they are, these so-called &lt;a href="http://en.wikipedia.org/wiki/Ice_dam"&gt;ice dams&lt;/a&gt; can do massive damage.  There are all sorts of products and services out there to try to make them go away.  But the best solution, of course, is to simply prevent them from developing in the first place, by addressing the root cause: stop wasting heat.  You can add insulation to your attic and/or turn down the thermostat.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-2515221350987474493?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/2515221350987474493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/02/so-many-icicles.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2515221350987474493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2515221350987474493'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/02/so-many-icicles.html' title='So many icicles'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-eFTmeUabVB0/TVa3Lr5R7KI/AAAAAAAAAFs/sj43J-rVTCg/s72-c/5.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7619229787096170890</id><published>2011-02-11T18:46:00.002-05:00</published><updated>2011-02-12T10:52:11.642-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Visualizing Lucene's segment merges</title><content type='html'>If you've ever wondered how Lucene picks segments to merge during indexing, it looks something like this:&lt;br /&gt;&lt;br /&gt;&lt;iframe title="YouTube video player" width="560" height="349" src="http://www.youtube.com/embed/YW0bOvLp72E?rel=0" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;That video displays segment merges while indexing the entire &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download"&gt;Wikipedia (English) export&lt;/a&gt; (29 GB plain text), played back at ~8X real-time.&lt;br /&gt;&lt;br /&gt;Each segment is a bar, whose height is the size (in MB) of the segment (log-scale).  Segments on the left are largest; as new segments are flushed, they appear on the right.  Segments being merged are colored the same color and, once the merge finishes, are removed and replaced with the new (larger) segment.  You can see the nice logarithmic staircase pattern that merging creates.&lt;br /&gt;&lt;br /&gt;By default, using &lt;tt&gt;ConcurrentMergeScheduler&lt;/tt&gt;, Lucene executes each merge in a separate thread, allowing multiple merges to run at once without blocking ongoing indexing.  The bigger the merge the longer it takes to finish.&lt;br /&gt;&lt;br /&gt;One simple metric you can use to measure overall merge cost is to divide the total number of bytes read/written for all merging by the final byte size of an index; smaller values are better. This is analogous to the &lt;a href="http://en.wikipedia.org/wiki/Write_amplification"&gt;write amplification&lt;/a&gt; measure that solid-state disks use, in that your app has written X MB but because of merging and deleted documents overhead, Lucene had to internally read and write some multiple of X.  You can think of this write amplification as a tax on your indexing; you don't pay this tax up front, when the document is first indexed, but only later as you continue to add documents to the index.  The video shows the total size of the index as well as net bytes merged, so it's easy to compute write amplification for the above run: 6.19 (final index size was 10.87 GB and net bytes copied during merging was 67.30 GB).&lt;br /&gt;&lt;br /&gt;Proper merge selection is actually a tricky problem, in general, because we must carefully balance not burning CPU/IO (due to inefficient merge choices), while also not allowing too many segments to accumulate in the index, as this slows down search performance.  To minimize merge cost, you ideally would merge only equal-sized segments, and merge a larger number of segments at a time.&lt;br /&gt;&lt;br /&gt;In fact, from the viewpoint of the &lt;tt&gt;MergePolicy&lt;/tt&gt;, this is really a game against a sneaky opponent who randomly makes sudden changes to the index, such as flushing new segments or applying new deletions.  If the opponent is well behaved, it'll add equal sized, large segments, which are easy to merge well, as was the case in the above video; but that's a really easy game, like playing tic-tack-toe against a 3 year old.&lt;br /&gt;&lt;br /&gt;This opponent is more like playing a game of chess:&lt;br /&gt;&lt;br /&gt;&lt;iframe title="YouTube video player" width="640" height="390" src="http://www.youtube.com/embed/ojcpvIY3QgA?rel=0" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;No more nice looking staircase!  This test shows the more challenging near-real-time use case, which calls &lt;tt&gt;updateDocument&lt;/tt&gt; (= delete + add) at a high rate and frequently opens a new reader (creating a new segment each time).  The dark gray band on top of each segment shows the proportion of deletions in that segment.  When you delete a document in Lucene, the bytes consumed by that document are not reclaimed until the segment is merged, and you can see old segments being eroded as new segments are appended to the index. Unfortunately, Lucene's current default &lt;tt&gt;LogByteSizeMergePolicy&lt;/tt&gt; struggles to pick good merges against this opponent, often merging irregularly sized segments.&lt;br /&gt;&lt;br /&gt;The big issue with &lt;tt&gt;LogByteSizeMergePolicy&lt;/tt&gt; is that it must pick adjacent segments for merging.  However, we recently &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1076"&gt;relaxed this longstanding limitation&lt;/a&gt;, and I'm working on a new merge policy, &lt;tt&gt;TieredMergePolicy&lt;/tt&gt; (currently a patch on &lt;a href="https://issues.apache.org/jira/browse/LUCENE-854"&gt;LUCENE-854&lt;/a&gt;) to take advantage of this. &lt;tt&gt;TieredMergePolicy&lt;/tt&gt; also fixes some other limitations of &lt;tt&gt;LogByteSizeMergePolicy&lt;/tt&gt;, such as merge cascading that results in occasionally "inadvertently optimizing" the index as well as the overly coarse control it offers over the maximum segment size.&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;TieredMergePolicy&lt;/tt&gt; first computes the allowed "budget" of how many segments should be in the index, by counting how many steps the "perfect logarithmic staircase" would require given total index size, minimum segment size (floored), &lt;tt&gt;mergeAtOnce&lt;/tt&gt;, and a new configuration &lt;tt&gt;maxSegmentsPerTier&lt;/tt&gt; that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be.&lt;br /&gt;&lt;br /&gt;Whenever the index is over-budget, it selects the best merge. Potential merges are scored with a combination of skew (basically how "equally sized" the segments being merged are), total size (smaller merges are favored), and how many deleted documents will be reclaimed.  It also tries to merge to the exact maximum segment size (default 5GB).&lt;br /&gt;&lt;br /&gt;Here's the same difficult near-real-time test, this time using &lt;tt&gt;TieredMergePolicy&lt;/tt&gt; instead:&lt;br /&gt;&lt;br /&gt;&lt;iframe title="YouTube video player" width="640" height="390" src="http://www.youtube.com/embed/YOklKW9LJNY?rel=0" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;Note how the segments are now sorted by size, since &lt;tt&gt;TieredMergePolicy&lt;/tt&gt; is allowed to merge non-adjacent segments.  For the above difficult run, the write amplification for Lucene's current default merge policy (&lt;tt&gt;LogByteSizeMergePolicy&lt;/tt&gt;) was 14.49 while the new merge policy (&lt;tt&gt;TieredMergePolicy&lt;/tt&gt;) was 13.64, a nice improvement, though not as much as I was expecting.  I suspect this is because &lt;tt&gt;TieredMergePolicy&lt;/tt&gt; works hard to hit the max segment size (5 GB), resulting in 6 maximum sized segments while &lt;tt&gt;LogByteSizeMergePolicy&lt;/tt&gt; had only 3.  These numbers are much higher than the 6.19 write amplification from the "easy" merging, since that merging was about as efficient as we can hope for.&lt;br /&gt;&lt;br /&gt;While &lt;tt&gt;TieredMergePolicy&lt;/tt&gt; is a good improvement over &lt;tt&gt;LogByteSizeMergePolicy&lt;/tt&gt;, it's still theoretically possible to do even better!  In particular, &lt;tt&gt;TieredMergePolicy&lt;/tt&gt; is greedy in its decision making: it only looks statically at the index, as it exists right now, and always chooses what looks like the best merge, not taking into account how this merge will affect future merges nor what further changes the opponent is likely to make to the index.  This is good, but it's not guaranteed to produce the optimal merge sequence.  For any series of changes made by the opponent there is necessarily a corresponding perfect sequence of merges, that minimizes net merge cost while obeying the budget.  If instead the merge policy used a search with some lookahead, such as the &lt;a href="http://en.wikipedia.org/wiki/Minimax"&gt;Minimax algorithm&lt;/a&gt;, it could do a better job setting up more efficient future merges. I suspect this theoretical gain is likely small in practice; but if there are any game theorists out there reading this now, I'd love to be proven wrong!&lt;br /&gt;&lt;br /&gt;I generated these movies with &lt;a href="http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/mergeViz.py"&gt;this simple Python script&lt;/a&gt;. It parses the &lt;tt&gt;infoStream&lt;/tt&gt; output from &lt;tt&gt;IndexWriter&lt;/tt&gt;, renders one frame at a time, saved as a PNG file in the local file system, using the &lt;a href="http://www.pythonware.com/products/pil"&gt;Python Imaging Library&lt;/a&gt;, and finally encodes all frames into a video using &lt;a href="http://www.pythonware.com/products/pil/"&gt;MEncoder&lt;/a&gt; with the &lt;a href="http://www.videolan.org/developers/x264.html"&gt;X264&lt;/a&gt; codec.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7619229787096170890?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7619229787096170890/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html#comment-form' title='20 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7619229787096170890'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7619229787096170890'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html' title='Visualizing Lucene&apos;s segment merges'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://img.youtube.com/vi/YW0bOvLp72E/default.jpg' height='72' width='72'/><thr:total>20</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8583646636049186013</id><published>2011-01-08T06:42:00.002-05:00</published><updated>2011-01-08T07:12:41.957-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Finite State Transducers, Part 2</title><content type='html'>In my &lt;a href="http://chbits.blogspot.com/2010/12/using-finite-state-transducers-in.html"&gt;last post&lt;/a&gt;, I described a &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698"&gt;cool incremental algorithm for building an FST&lt;/a&gt; from pre-sorted input/output pairs, and how we're folding it into Lucene.&lt;br /&gt;&lt;br /&gt;Progress!  The patch on &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2792"&gt;LUCENE-2792&lt;/a&gt; is now committed to Lucene's trunk (eventually 4.0), so that we now use an FST to hold all terms in RAM for the &lt;a href="http://chbits.blogspot.com/2010/10/lucenes-simpletext-codec.html"&gt;SimpleText&lt;/a&gt; codec.&lt;br /&gt;&lt;br /&gt;This was a great step forward, and it added the raw low-level infrastructure for building and using FSTs.  But, it's really just a toy usage, since the SimpleText codec is not for production use.&lt;br /&gt;&lt;br /&gt;I'm happy to report even more progress: we finally have a "real" usage for FSTs in Lucene!  With &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2843"&gt;LUCENE-2843&lt;/a&gt; (now committed), we use an FST to hold the terms index in RAM.&lt;br /&gt;&lt;br /&gt;Many operations in Lucene require looking up metadata for a requested term.  This information includes the number of documents containing the term, file pointers into postings files that actually store the docIDs and positions, etc. Because there could be many unique terms, all this term metadata resides on-disk in the terms dictionary file (&lt;tt&gt;_X.tis&lt;/tt&gt;).&lt;br /&gt;&lt;br /&gt;Looking up a term is then a two-step process.  First, we consult the terms index, which resides entirely in RAM, to map the requested term to a file pointer in the terms dictionary file. Second, we scan the terms in the on-disk terms dictionary file starting from that file pointer, to look for an exact match.&lt;br /&gt;&lt;br /&gt;Certain queries, such as &lt;tt&gt;TermQuery&lt;/tt&gt;, need only one exact term lookup. Others, such as &lt;tt&gt;FuzzyQuery&lt;/tt&gt; or the new, very cool &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2507"&gt;automaton spellchecker&lt;/a&gt;, which does not require a separate spellchecker index, perform many term lookups (potentially millions).&lt;br /&gt;&lt;br /&gt;Before &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2843"&gt;LUCENE-2843&lt;/a&gt;, in trunk, the terms index used packed &lt;tt&gt;byte[]&lt;/tt&gt; and &lt;tt&gt;int[]&lt;/tt&gt; arrays.  In fact, this was already a &lt;a href="http://chbits.blogspot.com/2010/07/lucenes-ram-usage-for-searching.html"&gt;huge improvement over the terms index in 3.0&lt;/a&gt;, but is still more wasteful than an FST since each term was stored in fully expanded form, while the FST shares common prefixes and suffixes.  Getting this working also required adding separate &lt;tt&gt;seekFloor&lt;/tt&gt; and &lt;tt&gt;seekCeil&lt;/tt&gt; methods to the &lt;tt&gt;FSTEnum&lt;/tt&gt; classes (like a &lt;tt&gt;SortedMap&amp;lt;T&amp;gt;&lt;/tt&gt;).&lt;br /&gt;&lt;br /&gt;To test this, I indexed the first 10 million 1KB documents derived from &lt;a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download"&gt;Wikipedia's English database download&lt;/a&gt;.  The resulting RAM required for the FST was ~38% - 52% smaller (larger segments see more gains, as the FST "scales up" well).  Not only is the RAM required much lower, but term lookups are also faster: the &lt;tt&gt;FuzzyQuery&lt;/tt&gt; &lt;tt&gt;united~2&lt;/tt&gt; was ~22% faster.&lt;br /&gt;&lt;br /&gt;This is very much win/win, so we've made this terms index the default for all core codecs (the terms dictionary and terms index are pluggable, so its easy for other codecs to use this as well).&lt;br /&gt;&lt;br /&gt;There are many other ways we can use FSTs in Lucene, and we've only just scratched the surface here.  In fact, FSTs offers such a sizable RAM reduction that I think for many, but not all, apps it'd be realistic to avoid the two-step term lookup process entirely and simply hold the entire terms dictionary in RAM.  This should make term lookup intensive queries potentially much faster, though we'd likely have to rework them to use different algorithms optimized for iterating directly through the terms as an FST.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8583646636049186013?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8583646636049186013/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2011/01/finite-state-transducers-part-2.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8583646636049186013'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8583646636049186013'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2011/01/finite-state-transducers-part-2.html' title='Finite State Transducers, Part 2'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-6016018046456427862</id><published>2010-12-03T11:40:00.009-05:00</published><updated>2010-12-07T13:36:58.112-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Using Finite State Transducers in Lucene</title><content type='html'>&lt;a href="http://en.wikipedia.org/wiki/Finite_state_transducer"&gt;FSTs&lt;/a&gt; are finite-state machines that map a term (byte sequence) to an arbitrary output.  They also look cool:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4pUbN9gxhUI/TPk21wErb9I/AAAAAAAAAFM/dhPcsyo3KV4/s1600/FSTExample.png"&gt;&lt;img style="cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4pUbN9gxhUI/TPk21wErb9I/AAAAAAAAAFM/dhPcsyo3KV4/s400/FSTExample.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5546524713148968914" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;That FST maps the sorted words &lt;i&gt;mop&lt;/i&gt;, &lt;i&gt;moth&lt;/i&gt;, &lt;i&gt;pop&lt;/i&gt;, &lt;i&gt;star&lt;/i&gt;, &lt;i&gt;stop&lt;/i&gt; and &lt;i&gt;top&lt;/i&gt;  to their ordinal number (0, 1, 2, ...).  As you traverse the arcs, you sum up the outputs, so &lt;i&gt;stop&lt;/i&gt; hits 3 on the &lt;b&gt;s&lt;/b&gt; and 1 on the &lt;b&gt;o&lt;/b&gt;, so its output ordinal is 4.  The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.&lt;br /&gt;&lt;br /&gt;Essentially, an FST is a SortedMap&amp;lt;ByteSequence,SomeOutput&amp;gt;, if the arcs are in sorted order.  With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup.  The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.&lt;br /&gt;&lt;br /&gt;There's a &lt;a href="http://www.cs.nyu.edu/~mohri/pub/fla.pdf"&gt;great deal of  theory&lt;/a&gt; behind FSTs.  They generally support the same operations as &lt;a href="http://en.wikipedia.org/wiki/Finite-state_machine"&gt;FSMs&lt;/a&gt; (determinize, minimize, union, intersect, etc.).  You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.&lt;br /&gt;&lt;br /&gt;There are some nice general-purpose FST toolkits (&lt;a href="http://www.openfst.org/"&gt;OpenFst&lt;/a&gt; looks great) that support all these operations, but for Lucene I decided to implement &lt;a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698"&gt;this neat algorithm&lt;/a&gt; which incrementally builds up the minimal unweighted FST from pre-sorted inputs.  This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.&lt;br /&gt;&lt;br /&gt;The resulting implementation (currently a patch on &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2792"&gt;LUCENE-2792&lt;/a&gt;) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap.  The resulting FST is 69 MB. It can also build a &lt;a href="http://en.wikipedia.org/wiki/Trie"&gt;prefix trie&lt;/a&gt;, pruning by how many terms come through each node, with even less memory.&lt;br /&gt;&lt;br /&gt;Note that because &lt;a href="http://en.wikipedia.org/wiki/Commutativity"&gt;addition is commutative&lt;/a&gt;, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST.  But I don't expect this will be a problem in practice for Lucene.&lt;br /&gt;&lt;br /&gt;In the patch I modified the &lt;a href="http://chbits.blogspot.com/2010/10/lucenes-simpletext-codec.html"&gt;SimpleText&lt;/a&gt; codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!&lt;br /&gt;&lt;br /&gt;There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something".  For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc.  We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact).  The FST could be used for a key/value store.  Lots of fun things to try!&lt;br /&gt;&lt;br /&gt;Many thanks to &lt;a href="http://www.cs.put.poznan.pl/dweiss/"&gt;Dawid Weiss&lt;/a&gt; for helping me iterate on this.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-6016018046456427862?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/6016018046456427862/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6016018046456427862'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6016018046456427862'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html' title='Using Finite State Transducers in Lucene'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_4pUbN9gxhUI/TPk21wErb9I/AAAAAAAAAFM/dhPcsyo3KV4/s72-c/FSTExample.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7892088929416242923</id><published>2010-11-03T13:41:00.003-04:00</published><updated>2010-11-03T14:07:04.432-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Health'/><title type='text'>Big Fat Fiasco</title><content type='html'>I just watched &lt;a href="http://www.fathead-movie.com/index.php/2010/10/28/video-of-the-big-fat-fiasco-speech/"&gt;this talk by Tom Naughton&lt;/a&gt;, describing the mis-steps and bad science over the years that have tricked us all into believing we should avoid fat and cholesterol in our diet when in fact we should be doing the exact opposite!&lt;br /&gt;&lt;br /&gt;Tom is an excellent speaker (he's a comedian!), mixing in humor in what is at heart a very sad topic.  He details the science that should have led us to conclude that excessive carbs and sugar consumption are in fact the root cause behind heart disease, diabetes and obesity, not fat and cholesterol as we've been incessantly told over the years.&lt;br /&gt;&lt;br /&gt;He shows how the "Fat Lazy Slob Theory" (also called "calories in minus calories out") so frequently put forth to explain weight gain is in fact wrong, and that instead the root cause is the biochemistry in your body driving you to eat when your blood sugar is too high.&lt;br /&gt;&lt;br /&gt;Tom's &lt;a href="http://www.amazon.com/dp/B001NRY6R2?tag=fatheadmoviec-20&amp;camp=14573&amp;creative=327641&amp;linkCode=as1&amp;creativeASIN=B001NRY6R2&amp;adid=1V3CT0XHH4P2RPBKRGZF&amp;"&gt;documentary movie Fat Head&lt;/a&gt;, which is getting great reviews on Amazon, delves into these same topics.  I haven't watched it yet but I plan to.&lt;br /&gt;&lt;br /&gt;So enjoy your butter, cheese, whole milk, eggs (with yolk!!) and cut back on sugars and carbs.  And don't drink soda!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7892088929416242923?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7892088929416242923/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/11/big-fat-fiasco.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7892088929416242923'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7892088929416242923'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/11/big-fat-fiasco.html' title='Big Fat Fiasco'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7430210650274432964</id><published>2010-10-25T16:38:00.002-04:00</published><updated>2010-10-25T16:53:50.099-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Health'/><title type='text'>Our medical system is a house of cards</title><content type='html'>I just came across this &lt;a href="http://www.theatlantic.com/magazine/archive/2010/11/lies-damned-lies-and-medical-science/8269/2/"&gt;great article&lt;/a&gt; about meta-researcher Dr. John Ioannidis.  Here's the summary:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Much of what medical researchers conclude in their studies is misleading, exaggerated, or flat-out wrong. So why are doctors—to a striking extent—still drawing upon misinformation in their everyday practice? Dr. John Ioannidis has spent his career challenging his peers by exposing their bad science.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The gist is that modern medical research is deeply flawed and biased such that the "conclusions" that you and I eventually read in the news headlines are often false.  I especially love his advice for us all:&lt;br /&gt;&lt;br /&gt;  &lt;i&gt;Ioannidis suggests a simple approach: ignore them all&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;This is in fact my approach!  I have a simple rule: if it tastes good it's good for you.  So I eat plenty of fat, salt, sugar, cholesterol, carbs, etc.  I love eggs and cheese and I always avoid low-fat or low-cholesterol foods.  I get lots of sun and never use sun screen.  I drink coffee and beer, daily.  I drink lots of water.  I get daily exercise, running and walking.  And I avoid hand sanitizers like Purell (I believe commonplace dirt/germs are in fact natural and good for you).  I strongly believe humans do not need pills to stay healthy.  I don't take a daily vitamin.  And I'm very healthy!&lt;br /&gt;&lt;br /&gt;This &lt;a href="http://discovermagazine.com/2005/nov/dialogue-abramson"&gt;short interview&lt;/a&gt; between Discover Magazine and Harvard clinician John Abramson echoes the same core problem.  Here's a choice quote:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Unfortunately, the medical world has a deep, deep conflict of interest: healthy people do not generate profits.  Capitalism is a horrible match to health care.&lt;br /&gt;&lt;br /&gt;So, next time your doctor prescribes a fancy new cool-sounding powerful drug like Alevia or Omosia or Nanotomopia or whatever, try to remember that our medical system is really built on a house of cards. Your doctor, let alone you, cannot possibly differentiate what's true from what's false.  Don't trust that large triple-blind random controlled trial that supposedly validated this cool new drug.  You are the guinea pig!  And it's only when these drugs cause all sorts of problems once they are really tested on the population at large that their true colors are revealed.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7430210650274432964?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7430210650274432964/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/10/our-medical-system-is-house-of-cards.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7430210650274432964'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7430210650274432964'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/10/our-medical-system-is-house-of-cards.html' title='Our medical system is a house of cards'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-9208491285884910380</id><published>2010-10-17T14:22:00.003-04:00</published><updated>2010-10-25T16:55:10.202-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Pics from BBQ after Lucene Revolution</title><content type='html'>I finally pulled the pics off my camera from last week's BBQ after Lucene Revolution in Boston, where much fun was had!  See them &lt;a href="http://picasaweb.google.com/mikemccand/BBQLuceneRevolutionOct2010"&gt;here&lt;/a&gt;.  It was awesome to finally meet everyone!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-9208491285884910380?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/9208491285884910380/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/10/pics-from-bbq-after-lucene-revolution.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9208491285884910380'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9208491285884910380'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/10/pics-from-bbq-after-lucene-revolution.html' title='Pics from BBQ after Lucene Revolution'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-2852573646272425537</id><published>2010-10-09T14:20:00.003-04:00</published><updated>2010-10-10T14:41:18.379-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Fun with flexible indexing</title><content type='html'>The &lt;a href="http://lucenerevolution.com"&gt;Lucene Revolution&lt;/a&gt; conference just wrapped up yesterday.  It was well attended (~300 or so people).  It was great fun to hear about all the diverse ways that Lucene and Solr are being used in the real world.&lt;br /&gt;&lt;br /&gt;I gave a talk about flexible indexing, coming in the next major release of Lucene (4.0).  Slides are &lt;a href="http://www.box.net/shared/mqhoyz5jv7"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-2852573646272425537?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/2852573646272425537/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/10/fun-with-flexible-indexing.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2852573646272425537'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2852573646272425537'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/10/fun-with-flexible-indexing.html' title='Fun with flexible indexing'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5556522128874720413</id><published>2010-10-05T19:27:00.003-04:00</published><updated>2010-10-05T19:36:39.929-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene's SimpleText codec</title><content type='html'>Inspired by &lt;a href="http://www.lucidimagination.com/search/document/b68846e383824653/how_to_export_lucene_index_to_a_simple_text_file#b68846e383824653"&gt;this question&lt;/a&gt; on the Lucene user's list, I created a new codec in Lucene called the SimpleText codec.  The best ideas come from the user's lists!&lt;br /&gt;&lt;br /&gt;This is of course only available in Lucene's current trunk, to be eventually released as the next major release (4.0).  Flexible indexing makes is easy to swap in different codecs to do the actual writing and reading of postings data to/from the index, and we have several fun codecs already available and more on the way...&lt;br /&gt;&lt;br /&gt;Unlike all other codecs, which save the postings data in compact binary files, this codec writes all postings to a single human-readable text file, like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;field contents&lt;br /&gt;  term file&lt;br /&gt;    doc 0&lt;br /&gt;      pos 5&lt;br /&gt;  term is&lt;br /&gt;    doc 0&lt;br /&gt;      pos 1&lt;br /&gt;  term second&lt;br /&gt;    doc 0&lt;br /&gt;      pos 3&lt;br /&gt;  term test&lt;br /&gt;    doc 0&lt;br /&gt;      pos 4&lt;br /&gt;  term the&lt;br /&gt;    doc 0&lt;br /&gt;      pos 2&lt;br /&gt;  term this&lt;br /&gt;    doc 0&lt;br /&gt;      pos 0&lt;br /&gt;END&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The codec is read/write, and fully functional.  All of Lucene's unit tests pass (slowly) with this codec (which, by the way, is an awesome way to test your own codecs).&lt;br /&gt;&lt;br /&gt;Note that the performance of SimpleText is quite poor, as expected! For example, there is no terms index for fast seeking to a specific term, no skipping data for fast seeking within a posting list, some operations require linear scanning, etc.  So don't use this one in production!&lt;br /&gt;&lt;br /&gt;But it should be useful for transparency, debugging, learning, teaching or anyone who is simply just curious about what exactly Lucene stores in its inverted index.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5556522128874720413?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5556522128874720413/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5556522128874720413'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5556522128874720413'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html' title='Lucene&apos;s SimpleText codec'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-4030087295420676823</id><published>2010-09-28T15:07:00.009-04:00</published><updated>2010-09-29T05:53:21.546-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Home automation'/><title type='text'>Track your home's live electricity usage with Python</title><content type='html'>Modern electric meters (the one attached to the outside of your house) emit an IR flash for each watt-hour consumed, through the port on the top of the meter.  It turns out, it's easy to detect this flash, decode it into "live" power usage, and make cool charts like this:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4pUbN9gxhUI/TKI9kH3HX_I/AAAAAAAAAB4/dEEWeYlh_cI/s1600/Elec.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 215px;" src="http://4.bp.blogspot.com/_4pUbN9gxhUI/TKI9kH3HX_I/AAAAAAAAAB4/dEEWeYlh_cI/s400/Elec.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5522043783904452594" /&gt;&lt;/a&gt;&lt;br /&gt;That's live KW on the Y axis and time on the X axis.&lt;br /&gt;&lt;br /&gt;The flash, at least for my meter, appears to have high temporal accuracy, meaning it flashes precisely as you cross 1.000 WH.  This is great, since it enables accurate, real-time power usage.  For example, when I turn on the lights in my home office, I quickly see the power jump by ~65 watts, and then drop again when I turn the lights off.&lt;br /&gt;&lt;br /&gt;This is a fun way to track down which appliances or crazy computers or parasitic drains are using up so much electricity in your house!&lt;br /&gt;&lt;br /&gt;I've gone through several designs for this over the years. My last attempt was destroyed when &lt;a href="http://chbits.blogspot.com/2010/06/our-house-was-hit-by-lightning.html"&gt;lightning hit my house&lt;/a&gt; (there's always a silver lining!), so, I decided to try something new this time and I'm very happy with the approach since it's much simpler, and, moves the pulse detection into Python.&lt;br /&gt;&lt;br /&gt;I use a trivial analog circuit to detect the IR pulse, consisting of two 100K resistors in series and an IR photodiode in parallel with one of the resistors.  Make sure you get the polarity right on the photodiode otherwise it won't work!  Mount the photodiode on your power meter, aligned so that it "sees" each IR pulse.  I also added a small (0.01 uF) ceramic capacitor in parallel with the same resistor, to suppress transient EM fields otherwise picked up by the relatively long wires out to my meter.&lt;br /&gt;&lt;br /&gt;Finally, I use this &lt;a href="http://www.amazon.com/Syba-SD-CM-UAUD-Adapter-C-Media-Chipset/dp/B001MSS6CS/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1284509194&amp;sr=1-1"&gt;simple USB audio adapter&lt;/a&gt; to move the problem into the digital domain, connecting to the ends of the two series resistors to the mic input to use it for the A/D conversion.  This USB audio adapter drives the mic input with ~4.0V bias voltage, which is great since otherwise you'd need an external source.&lt;br /&gt;&lt;br /&gt;When there's no pulse, the photo-diode acts roughly like an open switch, meaning there is a fixed 200K resistive load on the mic input. When a pulse happens (mine seem to last for ~10 msec), the photo-diode acts roughly like a closed switch, suddenly dropping the series resistance to 100K.  This drop causes a very detectible pulse (strongly negative then strongly positive) on the mic input's voltage, I suspect because there's a capacitor behind the bias voltage (but: I am no analog circuit engineer, so this is speculation!).&lt;br /&gt;&lt;br /&gt;You can plug this USB audio adapter into any computer; I use a &lt;a href="http://en.wikipedia.org/wiki/SheevaPlug"&gt;Sheeva plug computer&lt;/a&gt; (delightful device, very low power -- I have three!).  I record the digital samples (&lt;a href="http://linux.die.net/man/1/arecord"&gt;arecord&lt;/a&gt; works well, at a 2 Khz rate) and decode the samples in Python to detect a pulse whenever the value drops below -1000.  You can easily compute the live KW based on the time between two adjacent pulses, push this into a database, and build graphs on top of this using &lt;a href="http://code.google.com/apis/visualization/documentation/gallery.html"&gt;Google's visualization APIs&lt;/a&gt; (I use &lt;a href="http://danvk.org/dygraphs"&gt;dygraphs&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;My last approach didn't have nearly the temporal accuracy (ie, it smoothed heavily across time), which masks interesting things. For example, now I can tell the difference between a resistive load (the coffee maker, oven, crockpot) and an inductive load (refrigerator compressor, vacuum cleaner) because the inductive load has a huge spike in the beginning, as the motor consumes lots of power trying to spin up.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-4030087295420676823?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/4030087295420676823/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/09/track-your-homes-live-electricity-usage.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4030087295420676823'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4030087295420676823'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/09/track-your-homes-live-electricity-usage.html' title='Track your home&apos;s live electricity usage with Python'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_4pUbN9gxhUI/TKI9kH3HX_I/AAAAAAAAAB4/dEEWeYlh_cI/s72-c/Elec.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-3229532251165863380</id><published>2010-09-17T05:35:00.009-04:00</published><updated>2010-10-25T16:55:25.600-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene's indexing is fast!</title><content type='html'>&lt;div&gt;&lt;br /&gt;&lt;a href="http://wikipedia.org/"&gt;Wikipedia&lt;/a&gt; periodically exports all of the content on their site, providing a nice corpus for performance testing. I downloaded &lt;a href="http://download.wikimedia.org/enwiki/20100904/pages-articles.xml.bz2"&gt;their most recent English XML export&lt;/a&gt;: it uncompresses to a healthy 21 GB of plain text!  Then I fully indexed this with Lucene's current trunk (to be 4.0): it took 13 minutes and 9 seconds, or 95.8 GB/hour -- not bad!&lt;br /&gt;&lt;br /&gt;Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost.  Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc.  I use Lucene's &lt;tt&gt;contrib/benchmark&lt;/tt&gt; package to do the indexing; here's the alg I used:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer&lt;br /&gt;&lt;br /&gt;content.source = org.apache.lucene.benchmark.byTask.feeds.LineDocSource&lt;br /&gt;docs.file = /lucene/enwiki-20100904-pages-articles.txt&lt;br /&gt;&lt;br /&gt;doc.stored = true&lt;br /&gt;doc.term.vector = false&lt;br /&gt;doc.tokenized = false&lt;br /&gt;doc.body.stored = false&lt;br /&gt;doc.body.tokenized = true&lt;br /&gt;&lt;br /&gt;log.step.AddDoc=10000&lt;br /&gt;&lt;br /&gt;directory=FSDirectory&lt;br /&gt;compound=false&lt;br /&gt;ram.flush.mb = 256&lt;br /&gt;&lt;br /&gt;work.dir=/lucene/indices/enwiki&lt;br /&gt;&lt;br /&gt;content.source.forever = false&lt;br /&gt;&lt;br /&gt;CreateIndex&lt;br /&gt;&lt;br /&gt;{ "BuildIndex"&lt;br /&gt;[ { "AddDocs" AddDoc &gt; : * ] : 6&lt;br /&gt;- CloseIndex&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;RepSumByPrefRound BuildIndex&lt;/pre&gt;&lt;br /&gt;There is no field truncation taking place, since this is now disabled by default -- every token in every Wikipedia article is being indexed. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them.  I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. The index only has 4 fields -- title, date, body, and docid.&lt;br /&gt;&lt;br /&gt;I've done a few things to speed up the indexing:&lt;ul&gt;&lt;li&gt; Increase IndexWriter's RAM buffer from the default 16 MB to 256 MB&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Run with 6 threads&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Disable compound file format&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Reuse document/field instances (&lt;tt&gt;contrib/benchmark&lt;/tt&gt; does this by default)&lt;/li&gt;&lt;/ul&gt;Lucene's wiki &lt;a href="http://wiki.apache.org/lucene-java/ImproveIndexingSpeed"&gt;describes&lt;/a&gt; additional steps you can take to speed up indexing.&lt;br /&gt;&lt;br /&gt;Both the source lines file and index are on an Intel X25-M SSD, and I'm running it on a modern machine, with dual Xeon X5680s, overclocked to 4.0 Ghz, with 12 GB RAM, running Fedora Linux.  Java is &lt;tt&gt;64bit 1.6.0_21-b06&lt;/tt&gt;, and I run as &lt;tt&gt;java -server -Xmx2g -Xms2g&lt;/tt&gt;.  I could certainly give it more RAM, but it's not really needed.  The resulting index is 6.9 GB.&lt;br /&gt;&lt;br /&gt;Out of curiosity, I made a small change to &lt;tt&gt;contrib/benchmark&lt;/tt&gt;, to print the ingest rate over time. It looks like this (over a 100-second window):&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;a border=0 onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4pUbN9gxhUI/TJM9bDK2zaI/AAAAAAAAABw/vClPjJBoYYs/s1600/IngestRate.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 240px;" src="http://4.bp.blogspot.com/_4pUbN9gxhUI/TJM9bDK2zaI/AAAAAAAAABw/vClPjJBoYYs/s400/IngestRate.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5517821503375592866" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Note that a large part (slightly over half!) of the time, the ingest rate is 0; this is not good!  This happens because the flushing process, which writes a new segment when the RAM buffer is full, is single-threaded, and, blocks all indexing while it's running.  This is a known issue, and is actively being addressed under &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2324"&gt;LUCENE-2324&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Flushing is CPU intensive -- the decode and reencode of the great many vInts is costly.  Computers usually have big write caches these days, so the IO shouldn't be a bottleneck.  With &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2324"&gt;LUCENE-2324&lt;/a&gt;, each indexing thread state will flush its own segment, privately, which will allow us to make full use of CPU concurrency, IO concurrency as well as concurrency across CPUs and the IO system. Once this is fixed, Lucene should be able to make full use of the hardware, ie fully saturate either concurrent CPU or concurrent IO such that whichever is the bottleneck in your context gates your ingest rate.  Then maybe we can double this already fast ingest rate!&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-3229532251165863380?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/3229532251165863380/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/09/lucenes-indexing-is-fast.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3229532251165863380'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3229532251165863380'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/09/lucenes-indexing-is-fast.html' title='Lucene&apos;s indexing is fast!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_4pUbN9gxhUI/TJM9bDK2zaI/AAAAAAAAABw/vClPjJBoYYs/s72-c/IngestRate.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7937320804872630057</id><published>2010-09-16T10:29:00.002-04:00</published><updated>2010-09-16T10:32:45.274-04:00</updated><title type='text'>Proper localization is important!</title><content type='html'>Sometimes it's &lt;a href="http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail"&gt;really important&lt;/a&gt; to get localization right!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7937320804872630057?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7937320804872630057/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/09/proper-localization-is-important.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7937320804872630057'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7937320804872630057'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/09/proper-localization-is-important.html' title='Proper localization is important!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-1591711536975387236</id><published>2010-09-13T14:11:00.005-04:00</published><updated>2010-09-13T14:52:20.853-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Fast search filters using flex</title><content type='html'>A filter in Lucene is a bit set that restricts the search space for any query; you pass it into IndexSearcher's search method.  It's effective for a number of use cases, such as document security, index partitions, facet drill-down, etc.&lt;br /&gt;&lt;br /&gt;To apply a filter, Lucene must compute the intersection of the documents matching the query against the documents allowed by the filter.  Today, we do that &lt;a href="https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/search/IndexSearcher.java"&gt;in IndexSearcher&lt;/a&gt; like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  while (true) {&lt;br /&gt;    if (scorerDoc == filterDoc) {&lt;br /&gt;      // Check if scorer has exhausted, only before collecting.&lt;br /&gt;      if (scorerDoc == DocIdSetIterator.NO_MORE_DOCS) {&lt;br /&gt;        break;&lt;br /&gt;      }&lt;br /&gt;      collector.collect(scorerDoc);&lt;br /&gt;      filterDoc = filterIter.nextDoc();&lt;br /&gt;      scorerDoc = scorer.advance(filterDoc);&lt;br /&gt;    } else if (scorerDoc &gt; filterDoc) {&lt;br /&gt;      filterDoc = filterIter.advance(scorerDoc);&lt;br /&gt;    } else {&lt;br /&gt;      scorerDoc = scorer.advance(filterDoc);&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;We call this the "leapfrog approach": the query and the filter take turns trying to advance to each other's next matching document, often jumping past the target document.  When both land on the same document, it's collected.&lt;br /&gt;&lt;br /&gt;Unfortunately, for various reasons this implementation is inefficient (these are spelled out more in &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1536"&gt;LUCENE-1536&lt;/a&gt;):&lt;ol&gt;&lt;li&gt; The advance method for most queries is costly.&lt;br /&gt;&lt;/li&gt;&lt;li&gt; The advance method for most filters is usually cheap.&lt;br /&gt;&lt;/li&gt;&lt;li&gt; If the number of documents matching the query is far higher than&lt;br /&gt;the number matching the filter, or vice versa, it's better to drive&lt;br /&gt;the matching by whichever is more restrictive.&lt;br /&gt;&lt;/li&gt;&lt;li&gt; If the filter supports fast random access, and is not super&lt;br /&gt;sparse, it's better to apply it during postings enumeration, like&lt;br /&gt;deleted docs.&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Query scorers don't have a random access API, only .advance(),&lt;br /&gt;which does unecessary extra work .next()'ing to the next matching&lt;br /&gt;document.&lt;/li&gt;&lt;/ol&gt;To fix this correctly, Lucene really needs an optimization stage, much like a modern database, which looks at the type and structure of the query and filter, as well as estimates of how many documents will match, and then picks an appropriate execution path.  Likely this will be coupled with source code specilization/generation (&lt;a href="https://issues.apache.org/jira/browse/LUCENE-1594"&gt;LUCENE-1594&lt;/a&gt;) to write dedicated java code to execute the chosen path. Some day we'll get to that point!&lt;br /&gt;&lt;br /&gt;Until then, there in a simple way to get a large speedup in many cases, addressing the 4th issue above.  Prior to flexible indexing, when you obtained the postings enumeration for documents matching a given term, Lucene would silently filter out deleted documents.  With flexible indexing, the API now allows you to pass in a bit set marking the documents to skip.  Normally you'd pass in the IndexReader's deleted docs.  But, with a simple subclass of FilterIndexReader, it's possible to use any filter as the documents to skip.&lt;br /&gt;&lt;br /&gt;To test this, I created a simple class, CachedFilterIndexReader (I'll attach it to &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1536"&gt;LUCENE-1536&lt;/a&gt;). You pass it an existing IndexReader, plus a Filter, and it creates an IndexReader that filters out both deleted documents and documents that don't match the provided filter.  Basically, it compiles the IndexReader's deleted docs (if any), and the negation of the incoming filter, into a single cached bit set, and then passes that bit set as the skipDocs whenever postings are requested.  You can then create an IndexSearcher from this reader, and all searches against it will be filtered according to the filter you provided.&lt;br /&gt;&lt;br /&gt;This is just a prototype, and has certain limitations, eg it doesn't implement reopen, it's slow to build up its cached filter, etc.&lt;br /&gt;&lt;br /&gt;Still, it works very well!  I tested it on a 10M Wikipedia index, with a random filter accepting 50% of the documents:&lt;br /&gt;&lt;br /&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="left"&gt;Query&lt;/td&gt;&lt;td align="left"&gt;QPS Default&lt;/td&gt;&lt;td align="left"&gt;QPS Flex&lt;/td&gt;&lt;td&gt;% change&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;united~0.7&lt;/td&gt;&lt;td&gt;19.95&lt;/td&gt;&lt;td&gt;19.25&lt;/td&gt;&lt;td&gt;&lt;span style="color:red;"&gt;-3.5%&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;un*d&lt;/td&gt;&lt;td&gt;43.19&lt;/td&gt;&lt;td&gt;52.21&lt;/td&gt;&lt;td&gt;&lt;span style="color:green;"&gt;20.9%&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;unit*&lt;/td&gt;&lt;td&gt;21.53&lt;/td&gt;&lt;td&gt;30.52&lt;/td&gt;&lt;td&gt;&lt;span style="color:green;"&gt;41.8%&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;"united states"&lt;/td&gt;&lt;td&gt;6.12&lt;/td&gt;&lt;td&gt;8.74&lt;/td&gt;&lt;td&gt;&lt;span style="color:green;"&gt;42.9%&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;+united +states&lt;/td&gt;&lt;td&gt;9.68&lt;/td&gt;&lt;td&gt;14.23&lt;/td&gt;&lt;td&gt;&lt;span style="color:green;"&gt;47.0%&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;united states&lt;/td&gt;&lt;td&gt;7.71&lt;/td&gt;&lt;td&gt;14.56&lt;/td&gt;&lt;td&gt;&lt;span style="color:green;"&gt;88.9%&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;states&lt;/td&gt;&lt;td&gt;15.73&lt;/td&gt;&lt;td&gt;36.05&lt;/td&gt;&lt;td&gt;&lt;span style="color:green;"&gt;129.2%&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;I'm not sure why the fuzzy query got a bit slower, but the speedups on the other queries are awesome.  However, this approach is actually slower if the filter is very sparse.  To test this, I ran just the TermQuery ("states"), against different filter densities:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4pUbN9gxhUI/TI5r3_yhKaI/AAAAAAAAABo/A2Vwa0olPxg/s1600/TermQuerySpeedup.png"&gt;&lt;img border=0 style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 240px;" src="http://1.bp.blogspot.com/_4pUbN9gxhUI/TI5r3_yhKaI/AAAAAAAAABo/A2Vwa0olPxg/s400/TermQuerySpeedup.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5516465203335735714" /&gt;&lt;/a&gt;&lt;br /&gt;The cutover, for TermQuery at least, is somewhere around 1.1%, meaning if the filter accepts more than 1.1% of the index, it's best to use the CachedFilterIndexReader class; otherwise it's best to use Lucene's current implementation.&lt;br /&gt;&lt;br /&gt;Thanks to this new flex API, until we can fix Lucene to properly optimize for filter and query intersection, this class gives you a viable, fully external, means of massive speedups for non-sparse filters!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-1591711536975387236?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/1591711536975387236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/09/fast-search-filters-using-flex.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1591711536975387236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1591711536975387236'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/09/fast-search-filters-using-flex.html' title='Fast search filters using flex'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_4pUbN9gxhUI/TI5r3_yhKaI/AAAAAAAAABo/A2Vwa0olPxg/s72-c/TermQuerySpeedup.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-6273816559664759310</id><published>2010-08-24T08:11:00.002-04:00</published><updated>2010-08-24T08:17:59.206-04:00</updated><title type='text'>Here comes another bubble</title><content type='html'>&lt;a href="http://www.youtube.com/watch?v=I6IQ_FOCE6I"&gt;Hilarious&lt;/a&gt;.  It's a spoof of Billy Joel's catchy &lt;a href="http://en.wikipedia.org/wiki/We_Didn't_Start_the_Fire"&gt;We Didn't Start the Fire&lt;/a&gt;.  Scary that this was posted more than two and a half years ago!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-6273816559664759310?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/6273816559664759310/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/08/here-comes-another-bubble.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6273816559664759310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6273816559664759310'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/08/here-comes-another-bubble.html' title='Here comes another bubble'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5495240611972763170</id><published>2010-08-07T08:22:00.000-04:00</published><updated>2010-08-07T08:23:14.913-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Kids'/><title type='text'>Do babies really sleep?</title><content type='html'>Those of you with kids have undoubtedly discovered that sometimes they sleep with their eyes half open.&lt;br /&gt;&lt;br /&gt;The first time it happens it's somewhat alarming, but, then, you get used to it and realize it's "normal".  It probably just means their eyelids haven't quite closed yet.&lt;br /&gt;&lt;br /&gt;Well, my one-year-old did this, yesterday, so I seized the opportunity to do an experiment!!  I wanted to test just how asleep she was, so, I smiled like crazy, made funny faces, scrunched my nose, stuck my tongue out, at her half open eye.&lt;br /&gt;&lt;br /&gt;And get this: she laughed!  In her sleep.  I repeated the test 3 times and each time she laughed.  Yet, she was most definitely "asleep" -- the eyelid eventually closed and she didn't wake up for another hour.&lt;br /&gt;&lt;br /&gt;I guess enough of the brain remains active to process what the eye is seeing even when the rest of the brain is sleeping?  Wild!&lt;br /&gt;&lt;br /&gt;Before you think I'm being unfair, she also conducts sleep experiments on us!  Last weekend the whole family went camping.  It was bed time and, as usual, we were all exhausted except our one-year-old.  There was enough low light so she could see, so, as we were all dropping off one by one, she took turns running back and forth and pulling hard on our chins to wake us up again!  Probably, from her perspective, this was like a horror movie, where her family is one by one dropping into comas and it's her job to prevent the disaster!&lt;br /&gt;&lt;br /&gt;Kids are great fun!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5495240611972763170?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5495240611972763170/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/08/do-babies-really-sleep.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5495240611972763170'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5495240611972763170'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/08/do-babies-really-sleep.html' title='Do babies really sleep?'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-4407921155446536616</id><published>2010-08-04T10:27:00.002-04:00</published><updated>2010-08-04T10:30:31.156-04:00</updated><title type='text'>Java developers are not consenting adults</title><content type='html'>&lt;a href="http://steve-yegge.blogspot.com/2010/07/wikileaks-to-leak-5000-open-source-java.html"&gt;Hilarious&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-4407921155446536616?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/4407921155446536616/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/08/java-developers-are-not-consenting.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4407921155446536616'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4407921155446536616'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/08/java-developers-are-not-consenting.html' title='Java developers are not consenting adults'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-6298478417350691637</id><published>2010-08-02T18:12:00.004-04:00</published><updated>2010-08-02T20:40:50.508-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene performance with the PForDelta codec</title><content type='html'>Today, to encode the postings (docs, freqs, positions) in the index, Lucene uses a variable byte format where each integer is individually encoded as 1-5 bytes.&lt;br /&gt;&lt;br /&gt;While this is wonderfully simple, it requires an if statement on every byte during decode, which is very costly since the CPU cannot easily &lt;a href="http://en.wikipedia.org/wiki/Branch_predictor"&gt;predict the branch outcome&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;To reduce the number of branches per int decode, you must switch to decoding groups of ints at once.  &lt;a href="http://www2008.org/papers/pdf/p387-zhangA.pdf"&gt;This paper&lt;/a&gt; describes a few such encoders (PForDelta, Rice Coding, Simple9, Simple16).  &lt;a href=""&gt;Chapter 6 (Index Compression)&lt;/a&gt; of &lt;a href="http://www.ir.uwaterloo.ca/book"&gt;Information Retrieval: Implementing and Evaluating Search Engines&lt;/a&gt; goes into great detail on these and others, including an &lt;a href="http://www.ir.uwaterloo.ca/book/addenda-06-index-compression.html"&gt;addenda&lt;/a&gt; showing results for &lt;a href="http://research.google.com/people/jeff/WSDM09-keynote.pdf"&gt;Google's Group VarInt&lt;/a&gt; encoder.&lt;br /&gt;&lt;br /&gt;The IntBlock codec is designed to make it easy to experiment with these sorts of different int encoders.  It does the "hard part" of enabling Lucene to operate on a set of ints at once, which is somewhat tricky as the seek points (in the terms dict and the skipping lists) must now encode both the file-pointer where the block starts as well as the index (starting int) into the block.&lt;br /&gt;&lt;br /&gt;There is already a low-level initial implementation Frame-of-reference (FOR) and Patched frame-of-reference (PFOR), on &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1410"&gt;LUCENE-1410&lt;/a&gt;, as well as Simple9/16 on &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2189"&gt;LUCENE-2189&lt;/a&gt;, thanks to Paul Elschot and Renaud Delbru.&lt;br /&gt;&lt;br /&gt;FOR is a simple encoding: it takes each fixed block of ints, ands stores them all as packed ints, where each value gets N bits, set by the maximum int in the block.  So say our block size is 8, and we have these ints:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  1 7 3 5 6 2 2 5&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;we'd only need 3 bits per value.  But, FOR has a clear weakness: if you have a single big int mixed in with a bunch of small ones, then you waste bits.  For example if we have these ints:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  1 7 3 5 293 2 2 5&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;then FOR must use 9 bits for all values (because of 293), despite the fact that the other values only needed 3 bits each.  How much this hurts in practice remains to be seen.&lt;br /&gt;&lt;br /&gt;PFOR fixes this by marking such large numbers as exceptions, which are then "patched" after the initial decode.  So for the above example, we would still use 3 bits per-value, but we'd mark that the 293 value was an "exception" and it would be separately encoded at the end (with more bits), and then "patched" back in at decode time.  &lt;a href="http://www2008.org/papers/pdf/p387-zhangA.pdf"&gt;The above paper&lt;/a&gt; has the details.&lt;br /&gt;&lt;br /&gt;Note that block-based codecs are not always a good fit, since in order to access even one int within the block, they [typically] must decode the full block.  This can be a high cost if you frequently need to decode just an int or two, such as with a primary key field.  &lt;a href="http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html"&gt;Pulsing codec&lt;/a&gt; is a better fit for such fields.&lt;br /&gt;&lt;br /&gt;During testing, I found a few silly performance problems in the IntBlock codec, which I've hacked around for now (I'll post my patch on &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1410"&gt;LUCENE-1410&lt;/a&gt;); the hacks are nowhere near committable, but are functioning correctly (identical search results for the queries I'm testing).  I also made some more minor optimizations to the FOR/PFOR implementation.  I'd like to get the IntBlock codec to the point where anyone can easily tie in a fixed or variable block size int encoding algorithm and run their own tests.&lt;br /&gt;&lt;br /&gt;I indexed the first 10 million Wikipedia ~1KB sized docs.  The resulting index sizes were:&lt;br /&gt;&lt;br /&gt;&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;b&gt;Codec&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;Size&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;% Change&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Standard&lt;/td&gt;&lt;td&gt;3.50 GB&lt;/td&gt;&lt;td&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FOR&lt;/td&gt;&lt;td&gt;3.96 GB&lt;/td&gt;&lt;td&gt;13.3% bigger&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PFOR&lt;/td&gt;&lt;td&gt;3.87 GB&lt;/td&gt;&lt;td&gt;11.1% bigger&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;br /&gt;The size increase of both FOR and PFOR are fairly low.  PFOR is smaller than FOR, as expected, but it's surprising that it's only a little bit lower.  Though, this is a function of where you draw the line for the exception cases and will be corpus dependent.  Still, I'm encouraged by how little additional space FOR requires over the Standard codec.&lt;br /&gt;&lt;br /&gt;I ran a set of queries and measured the best time (of 40 runs) for each, across 4 threads.  Here are the results for FOR:&lt;br /&gt;&lt;br /&gt;&lt;table&gt;&lt;tr&gt;&lt;td align=left&gt;&lt;b&gt;Query&lt;/b&gt;&lt;/td&gt;&lt;td align=left&gt;&lt;b&gt;QPS Standard&lt;/b&gt;&lt;/td&gt;&lt;td align=left&gt;&lt;b&gt;QPS FOR&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;% Change&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;united~0.6&lt;/td&gt;&lt;td&gt;5.59&lt;/td&gt;&lt;td&gt;5.08&lt;/td&gt;&lt;td&gt;&lt;font color=red&gt;-9.0%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;"united states"&lt;/td&gt;&lt;td&gt;11.46&lt;/td&gt;&lt;td&gt;11.22&lt;/td&gt;&lt;td&gt;&lt;font color=red&gt;-2.1%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;united~0.7&lt;/td&gt;&lt;td&gt;18.33&lt;/td&gt;&lt;td&gt;19.23&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;4.9%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;united states&lt;/td&gt;&lt;td&gt;15.53&lt;/td&gt;&lt;td&gt;18.66&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;20.1%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;uni*d&lt;/td&gt;&lt;td&gt;34.25&lt;/td&gt;&lt;td&gt;43.37&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;26.6%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;unit*&lt;/td&gt;&lt;td&gt;31.27&lt;/td&gt;&lt;td&gt;41.07&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;31.3%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;states&lt;/td&gt;&lt;td&gt;55.72&lt;/td&gt;&lt;td&gt;75.82&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;36.1%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;+united +states&lt;/td&gt;&lt;td&gt;15.17&lt;/td&gt;&lt;td&gt;21.43&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;41.2%&lt;/font&gt;&lt;/td&gt;&lt;/table&gt;&lt;br /&gt;and for PFOR:&lt;br /&gt;&lt;br /&gt;&lt;table&gt;&lt;tr&gt;&lt;td align=left&gt;&lt;b&gt;Query&lt;/b&gt;&lt;/td&gt;&lt;td align=left&gt;&lt;b&gt;QPS Standard&lt;/b&gt;&lt;/td&gt;&lt;td align=left&gt;&lt;b&gt;QPS PFOR&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;% Change&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;united~0.6&lt;/td&gt;&lt;td&gt;5.59&lt;/td&gt;&lt;td&gt;5.02&lt;/td&gt;&lt;td&gt;&lt;font color=red&gt;-10.1%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;"united states"&lt;/td&gt;&lt;td&gt;11.46&lt;/td&gt;&lt;td&gt;10.88&lt;/td&gt;&lt;td&gt;&lt;font color=red&gt;-5.1%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;united~0.7&lt;/td&gt;&lt;td&gt;18.33&lt;/td&gt;&lt;td&gt;19.00&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;3.7%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;united states&lt;/td&gt;&lt;td&gt;15.53&lt;/td&gt;&lt;td&gt;18.11&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;16.6%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;uni*d&lt;/td&gt;&lt;td&gt;34.25&lt;/td&gt;&lt;td&gt;43.71&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;27.6%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;states&lt;/td&gt;&lt;td&gt;55.72&lt;/td&gt;&lt;td&gt;71.39&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;28.1%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;unit*&lt;/td&gt;&lt;td&gt;31.27&lt;/td&gt;&lt;td&gt;41.42&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;32.5%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;+united +states&lt;/td&gt;&lt;td&gt;15.17&lt;/td&gt;&lt;td&gt;21.25&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;40.1%&lt;/font&gt;&lt;/td&gt;&lt;/table&gt;&lt;br /&gt;They both show good gains for most queries; FOR has a slight edge since it doesn't have to decode exceptions.  The united~0.6 (max edit distance = 2) is slower, likely because a good number of the 519 terms it expands to have low frequency, and so the overhead of a full block decode is hurting performance.  The PhraseQuery also got a little slower, probably because it uses non-block skipping during searching.  The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints.  Likely a hybrid codec for Lucene, using FOR or FFOR for high frequency terms, Standard for medium frequency, and Pulsing for very low frequency, would perform best overall.&lt;br /&gt;&lt;br /&gt;I ran one additional test, this time using FOR on MMapDirectory (instead of NIOFSDirectory used above), passing an IntBuffer from the mapped pages directly to the FOR decoder:&lt;br /&gt;&lt;br /&gt;&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;b&gt;Query&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;QPS Standard&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;QPS FOR&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;% Change&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;united~0.6&lt;/td&gt;&lt;td&gt;6.00&lt;/td&gt;&lt;td&gt;5.28&lt;/td&gt;&lt;td&gt;&lt;font color=red&gt;-12.1%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;united~0.7&lt;/td&gt;&lt;td&gt;18.48&lt;/td&gt;&lt;td&gt;20.53&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;11.1%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;"united states"&lt;/td&gt;&lt;td&gt;10.47&lt;/td&gt;&lt;td&gt;11.70&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;11.8%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;united states&lt;/td&gt;&lt;td&gt;15.21&lt;/td&gt;&lt;td&gt;20.25&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;33.2%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;uni*d&lt;/td&gt;&lt;td&gt;33.16&lt;/td&gt;&lt;td&gt;47.59&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;43.5%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;unit*&lt;/td&gt;&lt;td&gt;29.71&lt;/td&gt;&lt;td&gt;45.14&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;51.9%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;+united +states&lt;/td&gt;&lt;td&gt;15.14&lt;/td&gt;&lt;td&gt;23.65&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;56.2%&lt;/font&gt;&lt;/td&gt;&lt;tr&gt;&lt;td&gt;states&lt;/td&gt;&lt;td&gt;52.30&lt;/td&gt;&lt;td&gt;88.66&lt;/td&gt;&lt;td&gt;&lt;font color=green&gt;69.5%&lt;/font&gt;&lt;/td&gt;&lt;/table&gt;&lt;br /&gt;The results are generally even better for two reasons.  First, for some reason the Standard codec slows down a bit with MMapDirectory.  Second, the FOR codec speeds up substantially, because I'm able (hacked up though) do a no-copy decode with MMapDirectory.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Remember&lt;/b&gt; that these results are very preliminary, based on a patch with many non-committable hacks (eg deleted docs are ignored!), that doesn't pass all Lucene's unit tests, etc.  There are also still further optimizations to explore, and lots of work remains.  So it's not clear where the final numbers will be, but these initial results are very encouraging!&lt;br /&gt;&lt;br /&gt;If you have any other algorithms to try for encoding ints, pleaes pipe up!  It's very easy to have Lucene generate a few billion ints for you to encode/decode if you want to run standalone tests.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-6298478417350691637?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/6298478417350691637/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6298478417350691637'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6298478417350691637'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html' title='Lucene performance with the PForDelta codec'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-1380306987447631838</id><published>2010-07-14T05:17:00.004-04:00</published><updated>2010-07-14T05:44:55.058-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Moving readVInt to C</title><content type='html'>By far the hottest spot in Lucene during searching is the method (DataInput.readVInt) that decodes Lucene's variable-length integer representation (vInt).  This method is called an insane number of times, while iterating the postings lists (docs, freqs, positions), during query execution.&lt;br /&gt;&lt;br /&gt;This representation is not CPU friendly: for every single byte read, the CPU hits a hard-to-predict if statement (testing the high bit).  Alternative block-based formats do a good job reducing branches while decoding, such as &lt;a href="http://cis.poly.edu/cs912/indexcomp.pdf"&gt;these encodings&lt;/a&gt; (PFOR-DELTA is currently an experimental codec attached as a patch on &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1410"&gt;LUCENE-1410&lt;/a&gt;) or Google's &lt;a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf"&gt;Group Varint&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Anyway, I decided to test whether porting readVInt to C would gives us a performance boost.  I had known the overhead of JNI was highish in the past, but I was hoping in modern JREs this had been improved.&lt;br /&gt;&lt;br /&gt;Unfortunately, it wasn't: I see a ~10%-~40% slowdown, depending on the query.  Likely hotspot's inability to inline native methods is also a factor here.  Perhaps, if/when we switch to a block-based codec, we'll see a gain with a native implementation, since the amortized overhead of invoking JNI will be lower.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-1380306987447631838?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/1380306987447631838/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/moving-readvint-to-c.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1380306987447631838'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1380306987447631838'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/moving-readvint-to-c.html' title='Moving readVInt to C'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8830706149156202469</id><published>2010-07-13T16:56:00.004-04:00</published><updated>2010-07-13T17:25:37.344-04:00</updated><title type='text'>Apple censorship, again</title><content type='html'>It's sickening that Apple thinks it's OK to &lt;a href="http://www.tuaw.com/2010/07/12/apple-drops-consumer-reports-discussion-threads-down-memory-hole/"&gt;censor (remove) posts from their forums&lt;/a&gt;.  It's &lt;a href="http://www.tomshardware.com/reviews/apple-display-update,1747.html"&gt;not the first time they've done this&lt;/a&gt;, either.&lt;br /&gt;&lt;br /&gt;I consider it entrapment -- Apple hosts these forums, as the obvious place where we all can go to discuss any issues we have.  As with any forum, there's a natural implicit assumption of non-interference, much like I expect my ISP not to block my visits to random web-sites or my email provider to block certain emails or the local coffee shop to disallow discussions about certain topics.&lt;br /&gt;&lt;br /&gt;But then, suddenly, Apple acts like China: they censor that which they disagree with.&lt;br /&gt;&lt;br /&gt;C'mon Apple!&lt;br /&gt;&lt;br /&gt;You should be above such draconian behavior.  If you disagree with what's being said, play fair!  Come out and discuss the topic, explain your viewpoint, convince us.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8830706149156202469?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8830706149156202469/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/apple-censorship-again.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8830706149156202469'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8830706149156202469'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/apple-censorship-again.html' title='Apple censorship, again'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8290755928750361450</id><published>2010-07-11T08:17:00.002-04:00</published><updated>2010-10-25T16:55:41.105-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Health'/><title type='text'>Beware statins</title><content type='html'>&lt;a href="http://www.thincs.org/melchior1.htm"&gt;Statins -- Miracle drug or tragedy?&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8290755928750361450?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8290755928750361450/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/beware-statins.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8290755928750361450'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8290755928750361450'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/beware-statins.html' title='Beware statins'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-2242460177574274056</id><published>2010-07-10T11:34:00.002-04:00</published><updated>2010-07-10T11:40:23.505-04:00</updated><title type='text'>What motivates us</title><content type='html'>&lt;a href="http://www.youtube.com/watch?v=u6XAPnuFjJc"&gt;Great video&lt;/a&gt;, from Daniel Pink, animated by &lt;a href="http://www.thersa.org"&gt;RSA&lt;/a&gt;, summarizing &lt;a href="http://www.amazon.com/Drive-Surprising-Truth-About-Motivates/dp/1594488843"&gt;his book&lt;/a&gt; digging into what motivates people to do things.  I find it very accurate about myself, and I expect especially people working in open-source projects will resonate strongly with the surprising findings.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-2242460177574274056?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/2242460177574274056/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/what-motivates-us.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2242460177574274056'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2242460177574274056'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/what-motivates-us.html' title='What motivates us'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7667758208244272763</id><published>2010-07-10T11:18:00.002-04:00</published><updated>2010-07-10T11:21:21.784-04:00</updated><title type='text'>Encoding videos for Apple iPad/iPod, using mencoder</title><content type='html'>After much searching, poking, iterating, I found a magical mencoder command-line that encodes videos in the right format for the &lt;a href="http://www.apple.com/ipad"&gt;ipad&lt;/a&gt;.  Debugging this is hard because, if you get something wrong, iTunes simply silently fails to open the video file, giving no specific feedback as to what's wrong. I felt like a monkey on a typewriter...&lt;br /&gt;&lt;br /&gt;So here are the options, broken by group.  Set the output format to MP4:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;   -of lavf -lavfopts format=mp4&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Scale the video to width 1280, height to match the aspect ratio (if your source video is low resolution, eg DVD, leave this off; you can also crop (-vf crop=W:H:X:Y; run mplayer -vf cropdetect,scale to have it guess the crop for you), or de-interlance (I like -vf pp=lb), or any other video filter):&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  -vf scale=1280:-3&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Use x264 codec for video, with a zillion options (tweak that crf=XX if you want higher quality; lower values of XX give higher quality, but take more space):&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  -ovc x264 -x264encopts crf=28:vbv_maxrate=1500:nocabac:global_header:frameref=3:threads=auto:bframes=0:subq=6:mixed-refs=0:weightb=0:8x8dct=1:me=umh:partitions=all:qp_step=4:qcomp=0.7:trellis=1:direct_pred=auto&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Use faac codec for audio:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  -oac faac -faacopts br=160:mpeg=4:object=2:raw -channels 2 -srate 48000&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I'm sure many other command-line combinations would work; this was just the first combo I found that works.&lt;br /&gt;&lt;br /&gt;If you need to encode for the &lt;a href="http://www.apple.com/ipodtouch/"&gt;iPod Touch&lt;/a&gt;, just change the -vf scale=1280:-3 to -vf scale=640:-3 (at least this works for recent iPod Touch models).&lt;br /&gt;&lt;br /&gt;You can also use the nifty &lt;a href="http://mp4v2.googlecode.com/svn/doc/1.9.0/ToolGuide.html"&gt;mp4art&lt;/a&gt; tool to insert cover art after the encode finishes; this lets you set the image for the movie when browsing on the iPad/iPod or in iTunes, which is vital if young kids will be launching the movies!  I use PNG images, which work fine.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7667758208244272763?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7667758208244272763/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/encoding-videos-for-apple-ipadipod.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7667758208244272763'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7667758208244272763'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/encoding-videos-for-apple-ipadipod.html' title='Encoding videos for Apple iPad/iPod, using mencoder'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-6514883912557203153</id><published>2010-07-09T16:16:00.003-04:00</published><updated>2010-07-09T16:36:57.409-04:00</updated><title type='text'>Old Orchard Beach</title><content type='html'>We spent the past half week in Old Orchard Beach, Maine (US), in the &lt;a href="http://www.tripadvisor.com/Hotel_Review-g40792-d125883-Reviews-Crest_Motel-Old_Orchard_Beach_Maine.html"&gt;Crest Motel&lt;/a&gt;.  It's a great place!  The rooms are clean, the beach is right outside, and it's a short walk to &lt;a href="http://www.palaceplayland.com/"&gt;Palace Playland&lt;/a&gt;.  They have an indoor pool &amp; hot tub, and I'm happy to report that they do not &lt;a href="http://chbits.blogspot.com/2009/10/discrimination-against-kids.html"&gt;discriminate against kids&lt;/a&gt;.  The whole area is very family friendly, and the kids had a blast!&lt;br /&gt;&lt;br /&gt;But what I loved most was... while we were staying there, they happened to be installing Solar hot water panels!  At first we saw random holes drilled in the floor and ceiling, which made all of us very curious.  Then the next day these were filled in with insulated pipes.  Finally, this morning, we saw the solar panels themselves, being carried up and installed on the roof.  The location is excellent -- they have a large roof, and no tall trees anywhere nearby.  It looked like a high capacity installation, maybe ~250 square feet.  They expect to fully generate all hot water in the summer, and also a sizable portion of the (air) heat in the winter.&lt;br /&gt;&lt;br /&gt;I have a small (600W nominal, off-grid) solar electric installation in my house, self-installed.  In the summer, it easily offsets the electricity used by our home theater.  I involved the kids all throughout the installation, to teach them how important it is to find renewable sources for everything.  And whenever they or their friends watch movies, I remind them that it's all powered by sunshine electricity, which of course leads to a great curious discussion and more teachable moments.&lt;br /&gt;&lt;br /&gt;So I took this chance to (again) hammer home this vital lesson.  And, I'm very happy to see that others are adopting Solar energy!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-6514883912557203153?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/6514883912557203153/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/we-spent-past-half-week-in-old-orchard.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6514883912557203153'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6514883912557203153'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/we-spent-past-half-week-in-old-orchard.html' title='Old Orchard Beach'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5609717916341705905</id><published>2010-07-09T15:57:00.004-04:00</published><updated>2010-07-09T16:03:48.911-04:00</updated><title type='text'>Communism vs capitialism/democracy</title><content type='html'>While we in the "free" world like to believe our capitalistic, democratic way is better than the tight-fisted communist approach in, say, China, the situation is not really so clear cut.&lt;br /&gt;&lt;br /&gt;The Chinese government has full control to make changes to nearly everything in the country.  Yes, sometimes this power is used to awful ends, such as &lt;a href="http://en.wikipedia.org/wiki/Human_rights_in_the_People's_Republic_of_China"&gt;human rights violations&lt;/a&gt; and the &lt;a href="http://en.wikipedia.org/wiki/Internet_censorship_in_the_People's_Republic_of_China"&gt;great firewall of China&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;But then, this same unilateral power can lead to great progress, overnight.  For example, &lt;a href="http://news.bbc.co.uk/2/hi/business/10551140.stm"&gt;China plans to add a 5% tax on oil and gas consumption&lt;/a&gt; (plus other "raw materials").  The US, in contrast, has for a very long time offered &lt;a href="http://www.nytimes.com/2010/07/04/business/04bptax.html"&gt;massive tax breaks&lt;/a&gt; to the oil companies (this is why the price of our gasoline is horribly low compared to nearly every other country, encouraging us to buy massive, awfully fuel inefficient cars and trucks).&lt;br /&gt;&lt;br /&gt;Another example: a few years back, China suddenly required that &lt;a href="http://english.people.com.cn/200612/19/eng20061219_334047.html"&gt;all cell phone chargers adopt the same standard&lt;/a&gt;, so that when people buy new phones the do not need a new charger, thus eliminating a big source of electronics waste.  In contrast, in the US with our "new every 2", we have to throw away our chargers every time we upgrade since they rarely interoperate.&lt;br /&gt;&lt;br /&gt;A third example is &lt;a href="http://en.wikipedia.org/wiki/Fuel_economy_in_automobiles"&gt;fuel economy standards&lt;/a&gt;: China has for a long time had far more stringent requirements than the US.&lt;br /&gt;&lt;br /&gt;In contrast, accomplishing these excellent improvements in the US is nearly impossible: each small change requires a tremendous battle through our congress, many members of which are &lt;a href="http://www.crewsmostcorrupt.org"&gt;rather blatantly corrupt&lt;/a&gt;, accepting all sorts of creative bribes (campaign contributions) from corporations, to buy their vote.  And corporate influence got even stronger thanks the recent &lt;a href="http://en.wikipedia.org/wiki/Citizens_United_v._Federal_Election_Commission"&gt;Supreme Court landmark decision&lt;/a&gt; giving corporations much more freedom to influence elections.  Finally, congress members, especially these days, seem to vote almost always along party lines rather than what's actually best for our country's future.&lt;br /&gt;&lt;br /&gt;Don't get me wrong: net/net I'm quite happy I live in the US, given all the tradeoffs.  But, still, we can and should do better, and copying some of China's recent changes would be a great start.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5609717916341705905?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5609717916341705905/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/communism-vs-capitialismdemocracy.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5609717916341705905'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5609717916341705905'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/communism-vs-capitialismdemocracy.html' title='Communism vs capitialism/democracy'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7569642704584130653</id><published>2010-07-05T14:48:00.003-04:00</published><updated>2010-07-05T15:10:29.315-04:00</updated><title type='text'>A better grass</title><content type='html'>This &lt;a href="http://www.pearlspremium.com/"&gt;Pearl's premium grass&lt;/a&gt; looks awesome: mow only once per month; extremely drought tolerant (seldom or never water);  thrives without chemicals. It's the polar opposite of &lt;a href="http://chbits.blogspot.com/2009/08/life-support-grass.html"&gt;life-support grass&lt;/a&gt;, that's the gold standard in today's yards.&lt;br /&gt;&lt;br /&gt;Maybe, once our yard is in really bad shape, I'll try this seed, mixed with clover. I want our yard to have zero environmental cost, zero &lt;a href="http://en.wikipedia.org/wiki/Carbon_footprint"&gt;carbon footprint&lt;/a&gt;.  We currently have just that: we haven't watered our lawn in 2 years; no fertilizer, pesticides; no dethatching, mechanical aeration; very rarely raked, mowed; etc., except, it's slowly dying, and weeds are moving in, because we inherited prior life-support grass.&lt;br /&gt;&lt;br /&gt;I'd also like near-zero effort on our part (we don't want to hire a lawn care service because we feel that sets a bad example for our kids), and of course to have decent curb appeal. Surely this is not too much to expect?&lt;br /&gt;&lt;br /&gt;I wish there were more companies/people developing environmentally free/friendly alternatives to the standard life-support grass.  I'd do a bake-off with these alternatives on different spots in our challenging yard!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7569642704584130653?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7569642704584130653/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/better-grass.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7569642704584130653'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7569642704584130653'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/better-grass.html' title='A better grass'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-4880805976612618228</id><published>2010-07-05T09:37:00.002-04:00</published><updated>2010-07-05T09:42:05.297-04:00</updated><title type='text'>Ethernet lightning protection</title><content type='html'>I'm going to try using this &lt;a href="http://www.amazon.com/APC-ProtectNet-100BT-Ethernet-Protector/dp/B00006BBGX"&gt;APC ethernet surge suppressor&lt;/a&gt; to protect the important devices (desktop, laptop, filer, backups, FIOS) attached to our ethernet network.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-4880805976612618228?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/4880805976612618228/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/ethernet-lightning-protection.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4880805976612618228'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4880805976612618228'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/ethernet-lightning-protection.html' title='Ethernet lightning protection'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5484895994330610464</id><published>2010-07-05T08:59:00.003-04:00</published><updated>2010-07-05T15:07:40.384-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene's RAM usage for searching</title><content type='html'>For fast searching, Lucene loads certain data structures entirely into RAM:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;  &lt;li&gt; The terms dict index requires substantial RAM per indexed term (by default, every 128th unique term), and is loaded when IndexReader is created.  This can be a very large amount of RAM for indexes that have an unusually high number of unique terms; to reduce this, you can pass a terms index divisor when opening the reader.  For example, passing 2, which loads only every other indexed term, halves the RAM required. But, in tradeoff, seeking to a given term, which is required once for every TermQuery, will become slower as Lucene must do twice as much scanning (on average) to find the term.&lt;br /&gt;&lt;br /&gt;  &lt;li&gt; Field cache, which is used under-the-hood when you sort by a field, takes some amount of per-document RAM depending on the field type (String is by far the worst).  This is loaded the first time you sort on that field.&lt;br /&gt;&lt;br /&gt;  &lt;li&gt; Norms, which encode the a-priori document boost computed at indexing time, including length normalization and any boosting the app does, consume 1 byte per field X document used for searching. For example, if your app searches 3 different fields, such as body, title and abstract, then that requires 3 bytes of RAM, per document.  These are loaded on-demand the first time that field is searched.&lt;br /&gt;&lt;br /&gt;  &lt;li&gt; Deletions, if present, consume 1 bit per doc, created during IndexReader construction.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;Warming a reader is necessary because of the data structures that are initialized lazily (norms, FieldCache).  It's also useful to pre-populate the OS's IO cache with those pages that cover the frequent terms you're searching on.&lt;br /&gt;&lt;br /&gt;With flexible indexing, available in Lucene's trunk (4.0-dev), we've made great progress on reducing the RAM required for both the terms dict index and the String index field cache (&lt;a href="https://issues.apache.org/jira/browse/LUCENE-2380?focusedCommentId=12871342&amp;page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12871342"&gt;some details here&lt;/a&gt;).  We have substantially reduced the number of objects created for these RAM resident data structures, and switched to representing all character data as UTF8, not java's char, which halves the RAM required when the character data is simple ascii.&lt;br /&gt;&lt;br /&gt;So, I ran a quick check against a real index, created from the first 5 million documents from the &lt;a href="http://en.wikipedia.org/wiki/Wikipedia_database"&gt;Wikipedia database export&lt;/a&gt;.  The index has a single segment with no deletions.  I initialize a searcher, and then load norms for the body field, and populate the FieldCache for sorting by the title field, using JRE 1.6, 64bit:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;  &lt;li&gt; 3.1-dev requires 674 MB of RAM&lt;br /&gt;&lt;br /&gt;  &lt;li&gt; 4.0-dev requires 179 MB of RAM&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;That's a 73% reduction on RAM required!&lt;br /&gt;&lt;br /&gt;However, there seems to be &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2504"&gt;some performance loss when sorting by a String field&lt;/a&gt;, which we are still tracking down.&lt;br /&gt;&lt;br /&gt;Note that modern OSs will happily swap out RAM from a process, in order to increase the IO cache.  This is rather silly: Lucene loads these specific structures into RAM because we know we will need to randomly access them, a great many times.  Other structures, like the postings data, we know we will sweep sequentially once per search, so it's less important that these structures be in RAM.   When the OS swaps our RAM out in favor of IO cache, it's reversing this careful separation!&lt;br /&gt;&lt;br /&gt;This will of course cause disastrous search latency for Lucene, since many page faults may be incurred on running a given search.  On Linux, you can fix this by &lt;a href="http://kerneltrap.org/node/3000"&gt;tuning swappiness down to 0&lt;/a&gt;, which I try to do on every Linux computer I touch (most Linux distros default this to a highish number).  Windows also has a checkbox, under My Computer -&gt; Properties -&gt; Advanced -&gt; Performance Settings -&gt; Advanced -&gt; Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5484895994330610464?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5484895994330610464/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5484895994330610464'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5484895994330610464'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html' title='Lucene&apos;s RAM usage for searching'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-4366238026103733578</id><published>2010-07-03T10:39:00.000-04:00</published><updated>2010-07-03T10:40:09.422-04:00</updated><title type='text'>Java vs .NET</title><content type='html'>&lt;a href="http://www.youtube.com/watch?v=8EOQvgdyVBY"&gt;Hilarious&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-4366238026103733578?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/4366238026103733578/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/07/java-vs-net.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4366238026103733578'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4366238026103733578'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/07/java-vs-net.html' title='Java vs .NET'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5738874179245572873</id><published>2010-06-30T05:54:00.003-04:00</published><updated>2010-07-05T05:46:57.041-04:00</updated><title type='text'>Our house was hit by lightning</title><content type='html'>Believe it or not, our house was struck by lightning!  It happened a few days ago, as a cold front swept over, bringing with it some intense but short-lived thunderstorms.&lt;br /&gt;&lt;br /&gt;I was working on my computer when I heard a loud POP sound of a spark, behind me, in the utility closet.  At the same time my computer went blank, and there was insanely loud thunder clap.  My poor son was on the toilet at the time and told me he was so startled that he jumped high in the air and almost fell in!  Fortunately, none of us were hurt. &lt;br /&gt;&lt;br /&gt;The strike destroyed our central 16-port gigabit ethernet switch, 3 out of 4 LAN ports on my &lt;a href="http://www22.verizon.com/Residential/aboutFiOS/Overview.htm?CMP=DMC-CVS_ZZ_ZZ_E_TV_N_X001"&gt;FIOS&lt;/a&gt; NAT box, a couple power supplies and one netcam. It also fried the device I use to read the electrical (charging, inverting) data from the solar panels in my back yard, but the solar panels themselves, including the thick copper ground wires designed to "guide" lightning into the ground and away from the house, were all fine, as well as the charger, inverter and batteries.  My &lt;a href="http://en.wikipedia.org/wiki/1-Wire"&gt;1-Wire network&lt;/a&gt;, which I use to measure various indoor &amp; outdoor temperatures, is also still dead.  My wife's computer immediately shut down and rebooted, several times (spooky), but apparently unharmed.  My computer seemed to lose both ethernet ports, but then after much rebooting and testing plug-in ethernet cards, they came back to life.&lt;br /&gt;&lt;br /&gt;A large tree branch in our neighbor's yard fell down; the neighbors across the street called the fire department; yet another neighbor saw bright sparks in his basement and also lost a bunch of electronics.&lt;br /&gt;&lt;br /&gt;Almost certainly this was not a direct strike for us; otherwise things would have been vaporized instead of simply dead.  Instead, the sudden immense electro-magnetic field created at the direct strike radiates outward, creating the local equivalent of an &lt;a href="http://en.wikipedia.org/wiki/Electromagnetic_pulse"&gt;EMP bomb&lt;/a&gt;.  This EMF then induces high voltage and current in any wires it crosses; the closer you are to the direct strike, and the longer your wires are, the more damaing the induced voltage and current is.  In my case, apparently, the extensive network of ethernet wires in my house cause most of the damage.  This is a good reason to use WiFi!&lt;br /&gt;&lt;br /&gt;I will now buy myself &lt;a href="http://www.google.com/products?q=ethernet+lightning++protection&amp;hl=en&amp;aq=f"&gt;something&lt;/a&gt; to try to prevent this from happening again.&lt;br /&gt;&lt;br /&gt;Lightning is crazy stuff.  The &lt;a href="http://en.wikipedia.org/wiki/Lightning#Leader_formation_and_the_return_stroke"&gt;process the lightning goes through in seeking the path through which it will dump insane amounts of current&lt;/a&gt; is fascinating.  National Geographic has a &lt;a href="http://news.nationalgeographic.com/news/2004/06/0623_040623_lightningfacts.html"&gt;great facts page&lt;/a&gt;; for example, talking on your land-line telephone is the leading cause of lightning injuries inside the home. I suspect we may have been hit by &lt;a href="http://news.nationalgeographic.com/news/2004/06/0623_040623_lightningfacts_2.html"&gt;positive lightning&lt;/a&gt;, because we seemed to be hit, out of the blue, well before the storm itself seemed to arrive.&lt;br /&gt;&lt;br /&gt;Lightning strikes are apparently rather common; in just my immediate family this has now happened twice to me, and once each to my brother, father and grandparents!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5738874179245572873?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5738874179245572873/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/06/our-house-was-hit-by-lightning.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5738874179245572873'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5738874179245572873'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/06/our-house-was-hit-by-lightning.html' title='Our house was hit by lightning'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-136781016959263912</id><published>2010-06-29T05:21:00.001-04:00</published><updated>2010-07-05T15:08:00.167-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene in Action 2nd Edition is done!</title><content type='html'>Lucene in Action, 2nd Edition, is finally done: the eBook is &lt;a href="http://manning.com/hatcher3"&gt;available now&lt;/a&gt;, and the print book should be released on July 8th!&lt;br /&gt;&lt;br /&gt;The source code that goes along with the book is &lt;a href="http://manning.com/hatcher3/LIAsourcecode.zip"&gt;freely available&lt;/a&gt; and free to use (&lt;a href="http://www.apache.org/licenses/LICENSE-2.0.html"&gt;Apache Sofware License 2.0&lt;/a&gt;), and there are two free chapters (&lt;a href="http://manning.com/hatcher3/Sample-1.pdf"&gt;Chapter 1&lt;/a&gt;, and &lt;a href="http://manning.com/hatcher3/Sample-3.pdf"&gt;Chapter 3&lt;/a&gt;).  There is also a free green paper excerpted from the book, &lt;a href="http://manning.com/free/green_HotBackupsLucene.html"&gt;Hot Backups with Lucene&lt;/a&gt;, as well as the &lt;a href="http://www.code972.com/blog/2010/06/lucene-in-action-free-chapter-coupon-code/"&gt;section describing CLucene&lt;/a&gt;, the C/C++ port of Lucene.&lt;br /&gt;&lt;br /&gt;Writing is the best way to learn something -- because of this book I've learned all sorts of nooks and crannies in Lucene that I otherwise would not have explored for quite some time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-136781016959263912?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/136781016959263912/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/06/lucene-in-action-2nd-edition-is-done.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/136781016959263912'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/136781016959263912'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/06/lucene-in-action-2nd-edition-is-done.html' title='Lucene in Action 2nd Edition is done!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7486802931049203615</id><published>2010-06-21T11:48:00.002-04:00</published><updated>2010-06-21T11:58:46.949-04:00</updated><title type='text'>Beware String.substring's memory usage</title><content type='html'>I'm trying to build up a test index for testing Lucene's sort performance, to track down a &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2504"&gt;regression in String sorting performance between 3.x and 4.0&lt;/a&gt;, apparently from our &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2380"&gt;packed ints cutover&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;To do this, I want to use the unique title values from &lt;a href="http://en.wikipedia.org/wiki/Wikipedia_database"&gt;Wikipedia's full database export&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So I made a simple task in Lucene's contrib/benchmark framework to hold onto the first 1M titles it hits.  Titles tend to be small, say maybe average worst case 100 characters per document, so worst case RAM would be ~200 MB or so, right?&lt;br /&gt;&lt;br /&gt;Wrong!&lt;br /&gt;&lt;br /&gt;It turns out, in Java, when you call String's substring method, the resulting String returned to you keeps a reference to the original String, so the original String can never be GC'd if you hold onto the substring.  Java can do this "optimization" because Strings are immutable.&lt;br /&gt;&lt;br /&gt;For me, this "optimization" is a disaster: the title is obtained by getting the substring of a large string (derived from a line-doc file) that holds the full body text as well!  Instead of ~200 characters per unique title I was looking at ~25K characters!  Ugh.&lt;br /&gt;&lt;br /&gt;Fortunately, the workaround is simple -- use the String constructor that takes another String.  This forces a private copy.&lt;br /&gt;&lt;br /&gt;I imagine for many cases this "optimization" is very worthwhile.  If you have a large original string, and pull many substrings from it, and then discard all of those substrings and the original string, you should see nice gains from this "optimization".&lt;br /&gt;&lt;br /&gt;There is a &lt;a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4513622"&gt;longstanding bug opened&lt;/a&gt; for this; likely it will never be fixed.  Really, GC should be empowered to discard the original string and keep only the substring.  Or perhaps substring should have some heuristics as to when it's dangerous to keep the reference to the original String.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7486802931049203615?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7486802931049203615/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/06/beware-stringsubstrings-memory-usage.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7486802931049203615'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7486802931049203615'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/06/beware-stringsubstrings-memory-usage.html' title='Beware String.substring&apos;s memory usage'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-1292042746238217435</id><published>2010-06-20T09:10:00.002-04:00</published><updated>2010-06-20T09:35:26.227-04:00</updated><title type='text'>Geek Dad</title><content type='html'>We've had this great family ritual, for a few years now: at the start of every summer we pitch a tent in our back yard!  We leave it there all summer, and the kids have great fun, with neighbors and friends, playing in it, hiding from the rain, etc.&lt;br /&gt;&lt;br /&gt;We also pick a few nights to sleep in the tent, which feels almost as real as camping, yet you have the delightful freedom of taking a short walk back to the house if you forgot something.&lt;br /&gt;&lt;br /&gt;So, last night we slept in the tent.  But, this year I brought our new &lt;a href="http://www.apple.com/ipad/"&gt;iPad&lt;/a&gt; with us.  The kids took turns playing a few games; one of their favorites is &lt;a href="http://itunes.apple.com/us/app/gomi-hd/id370888255?mt=8"&gt;Gomi HD&lt;/a&gt;, a game I also love for its not-so-subtle message that we humans are screwing up the planet and it's up to the kids to fix it.  Then we watched &lt;a href="http://en.wikipedia.org/wiki/Shrek_2"&gt;Shrek 2&lt;/a&gt;, which I had previously encoded for the iPad (using &lt;a href="http://handbrake.fr/"&gt;Handbrake&lt;/a&gt;).  Finally, the kids, one by one, fell asleep.&lt;br /&gt;&lt;br /&gt;Then, in the middle of the night, I woke up and heard rain hitting the tent.  I was worried that it could turn into a big storm (in past years our tent has been literally flattened by passing thunderstorms, once when we were inside!), so, I turned on the iPad and loaded the local &lt;a href="http://www.wunderground.com/radar/radblast.asp?zoommode=pan&amp;prevzoom=zoom&amp;num=6&amp;frame=0&amp;delay=15&amp;scale=1.000&amp;noclutter=0&amp;ID=BOX&amp;type=N0R&amp;showstorms=0&amp;lat=42.44557953&amp;lon=-71.23622131&amp;label=Lexington,%20MA&amp;map.x=400&amp;map.y=240&amp;scale=1.000&amp;centerx=400&amp;centery=240&amp;showlabels=1&amp;rainsnow=0&amp;lightning=0&amp;lerror=20&amp;num_stns_min=2&amp;num_stns_max=9999&amp;avg_off=9999&amp;smooth=0"&gt;NexRAD radar&lt;/a&gt;, and confirmed that in fact it was just a small passing cell: phew!&lt;br /&gt;&lt;br /&gt;Finally, I woke up, and was ready to get up, so I used the iPad again, this time to start the coffee maker.  I have a simple Web server, written in Python, that exposes an HTML interface to variance lights and appliances controlled via &lt;a href="http://www.insteon.net/"&gt;Insteon&lt;/a&gt;, including the coffee maker.  It's vital that I minimize the time from when I first stand up in the morning to when I consume my coffee!&lt;br /&gt;&lt;br /&gt;Yes, I'm a geek Dad.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-1292042746238217435?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/1292042746238217435/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/06/geek-dad.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1292042746238217435'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1292042746238217435'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/06/geek-dad.html' title='Geek Dad'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-2119311654280007305</id><published>2010-06-14T18:15:00.004-04:00</published><updated>2010-06-14T18:34:17.082-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene and fadvise/madvise</title><content type='html'>While indexing, Lucene periodically merges multiple segments in the index into a single larger segment.  This keeps the number of segments relatively contained (important for search performance), and also reclaims disk space for any deleted docs on those segments.&lt;br /&gt;&lt;br /&gt;However, it has a well known problem: the merging process evicts pages from the OS's buffer cache.  The eviction is ~2X the size of the merge, or ~3X if you are using compound file.&lt;br /&gt;&lt;br /&gt;If the machine is dedicated to indexing, this usually isn't a problem; but on a machine that's also searching, this can be catastrophic as it "unwarms" your warmed reader.  Users will suddenly experience long delays when searching.  And because a large merge can take hours, this can mean hours of suddenly poor search performance.&lt;br /&gt;&lt;br /&gt;So why hasn't this known issue been fixed yet?  Because Java, unfortunately, does not expose access to the low-level APIs (&lt;a href="http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html"&gt;posix_fadvise&lt;/a&gt;, &lt;a href="http://www.opengroup.org/onlinepubs/000095399/functions/posix_madvise.html"&gt;posix_madvise&lt;/a&gt;) that would let us fix this.  It's not even clear whether &lt;a href="http://java.sun.com/developer/technicalArticles/javase/nio/"&gt;NIO.2&lt;/a&gt; (in Java 7) &lt;a href="http://markmail.org/message/jvwv3ne2se3gjeee"&gt;will expose these&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;On the Lucene dev list we've long assumed that these OS-level functions should fix the issue, if only we could access them.&lt;br /&gt;&lt;br /&gt;So I decided to make a quick and dirty test to confirm this, using a small JNI extension.&lt;br /&gt;&lt;br /&gt;I created a big-ish (~7.7G) multi-segment Wikipedia index, and then ran a set of ~2900 queries against this index, over and over, letting it warm up the buffer cache.  Looking at /proc/meminfo (on Linux) I can see that the queries require ~1.4GB of hot RAM in the buffer cache (this is a CentOS Linux box with 3G RAM; the index is on a "normal" SATA hard drive).  Finally, in a separate JRE, I opened an IndexWriter and called optimize on the index.&lt;br /&gt;&lt;br /&gt;I ran this on trunk (4.0-dev), first, and confirmed that after a short while, the search performance indeed plummets (by a factor of ~35), as expected.  RAM is much faster than hard drives!&lt;br /&gt;&lt;br /&gt;Next, I modified Lucene to call posix_fadvise with the &lt;tt&gt;NOREUSE&lt;/tt&gt; flag; from the man page, this flag looks perfect:&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Specifies that the application expects to access the specified data once and then not reuse it thereafter.&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;I re-ran the test and.... nothing changed!  Exactly the same slowdown.  So I did some digging, and found Linux's &lt;a href="http://lxr.free-electrons.com/source/mm/fadvise.c"&gt;source code for posix_fadvise&lt;/a&gt;.  If you look closely you'll see that the NOREUSE is a no-op!  Ugh.&lt;br /&gt;&lt;br /&gt;This is really quite awful.  Besides Lucene, I can imagine a number of other apps that really should use this flag.  For example, when mencoder slowly reads a 50 GB bluray movie, and writes a 5 GB H.264 file, you don't want any of those bytes to pollute your buffer cache. Same thing for rsync, backup programs, software up-to-date checkers, desktop search tools, etc.  Of all the flags, this one seems like the most important to get right!  It's possible other OSs do the right thing; I haven't tested.&lt;br /&gt;&lt;br /&gt;So what to do?&lt;br /&gt;&lt;br /&gt;One approach is to forcefully free the pages, using the DONTNEED flag. This will drop the specified pages from the buffer cache.  But there's a serious problem: the search process is using certain pages in these files!  So you must only drop those pages that the merge process, alone, had pulled in.  You can use the &lt;a href="http://www.kernel.org/doc/man-pages/online/pages/man2/mincore.2.html"&gt;mincore&lt;/a&gt; function, to query for those pages that are already cached, so you know which ones not to drop.  A &lt;a href="http://insights.oetiker.ch/linux/fadvise"&gt;neat patch for rsync&lt;/a&gt; took exactly this approach.  The problem with this is mincore provides only a snapshot, so you'd have to call it many times while merging to try to minimize discarding pages that had been recently cached for searching.&lt;br /&gt;&lt;br /&gt;We should not have to resort to such silly hacks!&lt;br /&gt;&lt;br /&gt;Another approach is to switch to memory-mapped IO, using Lucene's MMapDirectory, and then use madvise.  The SEQUENTIAL option looks promising from the man page:&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Expect page references in sequential order.  (Hence, pages in the given range can be aggressively read ahead, and may be freed soon after they are accessed.)&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;Looking through the linux sources it look like the SEQUENTIAL option is at least not a no-op; that setting has some influence over how pages are evicted.&lt;br /&gt;&lt;br /&gt;So I tested that, but, alas, the search performance still plummets. No go!&lt;br /&gt;&lt;br /&gt;Yet another approach is to bypass all OS caching entirely, only when merging, by using the Linux-specific O_DIRECT flag.  Merge performance will definitely take a hit, since the OS is no longer doing readahead nor write caching, and every single IO request must hit the disk while you wait, but for many apps this could be a good tradeoff.&lt;br /&gt;&lt;br /&gt;So I created a prototype Directory implementation, a variant of DirectNIOFSDirectory (currently a patch on &lt;a href="https://issues.apache.org/jira/browse/LUCENE-2056"&gt;LUCENE-2056&lt;/a&gt;), that opened all files (input and output) with O_DIRECT (using jni). It's a little messy because all IO must be "aligned" by certain rules (I followed the rules for 2.6.* kernels).&lt;br /&gt;&lt;br /&gt;Finally, this solution worked great!  Search performance was unchanged all through the optimize call, including building the CFS at the end. Third time's a charm!&lt;br /&gt;&lt;br /&gt;However, the optimize call slowed down from 1336 to 1680 seconds (26% slower).  This could likely be reduced by further increasing the buffer sizess (I used 1 MB buffer for each IndexInput and IndexOutput, which is already large), or possibly creating our own readahead / write cache scheme.&lt;br /&gt;&lt;br /&gt;We &lt;em&gt;really&lt;/em&gt; should not have to resort to such silly hacks!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-2119311654280007305?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/2119311654280007305/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html#comment-form' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2119311654280007305'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2119311654280007305'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html' title='Lucene and fadvise/madvise'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-2834791922507032481</id><published>2010-06-06T11:58:00.010-04:00</published><updated>2010-06-06T13:54:30.979-04:00</updated><title type='text'>Finding the lead in your house</title><content type='html'>It's known that &lt;a href="http://www.cnn.com/2010/HEALTH/05/13/lead.poisoning.landrigan/index.html"&gt;exposure to lead&lt;/a&gt;, even in tiny amounts, causes loss of intelligence and other nasty problems in children.&lt;br /&gt;&lt;br /&gt;We have now phased out lead paint, leaded gasoline, and &lt;a href="http://en.wikipedia.org/wiki/Restriction_of_Hazardous_Substances_Directive"&gt;lead (and other dangerous elements) in electronics&lt;/a&gt;.  But, surprisingly, it was only relatively recently (2008) with the passage of &lt;a href="http://en.wikipedia.org/wiki/Consumer_Product_Safety_Improvement_Act"&gt;Consumer Product Safety Improvement Act&lt;/a&gt; that we finally set limits on the amount of lead in consumer items, particularly kids toys.  As of June 2010, the legal limit for lead in kid's toys is 300 ppm (parts per million) and will drop to 100 ppm by August 2011.  The legal limit for cadmium in paint on kid's toys is 75 ppm.&lt;br /&gt;&lt;br /&gt;Our house is filled with all sorts of toys, many of which we've picked up as hand-me-downs from various sources.  I've long wondered whether these toys have lead, cadmium, etc...&lt;br /&gt;&lt;br /&gt;So, I decided to test them myself, using an &lt;a href="http://www.niton.com/Niton-Analyzers-Products/xl3/xl3t.aspx?sflang=en"&gt;XRF analyzer&lt;/a&gt;.  This amazing handheld device emits a small directed xray beam out the front, causing elements to &lt;a href="http://en.wikipedia.org/wiki/X-ray_fluorescence"&gt;fluoresce&lt;/a&gt; with specific spectral signatures.  The device detects the strength of these signatures and reports back to you the breakdown of elements in the sample, either in parts per million (ppm) or in percentage (1% = 10,000 ppm!).&lt;br /&gt;&lt;br /&gt;In addition to &lt;a href="http://en.wikipedia.org/wiki/Lead"&gt;lead&lt;/a&gt;, the device reliably detects other elements like &lt;a href="http://en.wikipedia.org/wiki/Cadmium"&gt;cadmium&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Bromine"&gt;bromine&lt;/a&gt; (used in &lt;a href="http://en.wikipedia.org/wiki/Brominated_flame_retardant"&gt;brominated flame retardants&lt;/a&gt;), &lt;a href="http://en.wikipedia.org/wiki/Chlorine"&gt;chlorine&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Arsenic"&gt;arsenic&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Mercury_(element)"&gt;mercury&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Tin"&gt;tin&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Antimony"&gt;antimony&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Chromium"&gt;chromium&lt;/a&gt;, and many others!&lt;br /&gt;&lt;br /&gt;There are some limitations.  For example, all of the door knobs in my house tested high (1-2%) for lead, however, it turns out they are nickel plated but the XRF analyzer was seeing through this layer to the lead beneath.  Likely this lead would never transfer to our hands, unless the nickel wears through.&lt;br /&gt;&lt;br /&gt;Another limitation is that the device detects elements, not their molecular form.  For example, certain forms of &lt;a href="http://en.wikipedia.org/wiki/Chromium"&gt;chromium&lt;/a&gt;, like hexavalent chromium, are toxic, while other forms, like trivalent chromium, is useful and necessary in the human body.&lt;br /&gt;&lt;br /&gt;Finally, just because a material contains a given element doesn't mean that element would ever find its way into a child's body.  For example, lead bound in plastic is likely difficult to dislodge.&lt;br /&gt;&lt;br /&gt;For all these reasons, just because the analyzer detects certain elements in a given item, does not mean the item could cause harm.  Think of the analyzer as a simple fact-finder: it reports the raw elements it sees.  What action you then choose to take is your decision.  However, the &lt;a href="http://en.wikipedia.org/wiki/Precautionary_principle"&gt;precautionary principle&lt;/a&gt; applies here: if an item does have measurable amounts of lead or cadmium, why risk exposing your family to it?&lt;br /&gt;&lt;br /&gt;While the devices are still insanely expensive, they have come down in price enough to make rental just barely within reach for the end consumer.  I rented mine (a &lt;a href="http://www.niton.com/Niton-Analyzers-Products/xl3/xl3t.aspx?sflang=en"&gt;Niton XL3T&lt;/a&gt;) for a few hundred dollars a day from from &lt;a href="http://www.ashtead-technology.com/"&gt;Ashtead Technology&lt;/a&gt;, and split the cost with a neighbor (she tested by night and I tested by day!).&lt;br /&gt;&lt;br /&gt;So I walked around my house, testing everything, and the results were a real eye-opener!  A number of ordinary looking things had lead, sometimes at levels far higher than the 300 ppm legal limit, plus other possibly harmful elements:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_4pUbN9gxhUI/TAvXFTVMZQI/AAAAAAAAAAc/4Aweqn3JOMQ/s1600/lead1.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 225px;" src="http://3.bp.blogspot.com/_4pUbN9gxhUI/TAvXFTVMZQI/AAAAAAAAAAc/4Aweqn3JOMQ/s400/lead1.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5479709857714824450" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The green rain coat has 6,537 ppm lead; the handle on the lacrosse stick has 14,700 ppm lead; the basketball has 3,320 ppm lead and 322 ppm arsenic; the lunchbox has 677 ppm lead; the doctor's bag has 2,511 ppm antimony and 55 ppm arsenic.  Many smaller toys also have lead:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4pUbN9gxhUI/TAvXrmRowzI/AAAAAAAAAAk/Ch72RG8XcJU/s1600/lead3.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 397px; height: 317px;" src="http://4.bp.blogspot.com/_4pUbN9gxhUI/TAvXrmRowzI/AAAAAAAAAAk/Ch72RG8XcJU/s400/lead3.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5479710515635209010" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The red measuring spoon has 1,651 ppm lead; the blue dinosaur has 767 ppm lead and 91 ppm cadmium; the red car has 860 ppm lead; the red plate has 3,268 ppm lead; the blue train has 271 ppm lead; the little green ABC has 1,015 ppm lead.  The three legos were surprising, having between 1,245 and 2,427 ppm lead -- they are not actually legos; they are a copycat brand (Mega Bloks).  All of our "real" legos tested fine.&lt;br /&gt;&lt;br /&gt;Other toys have high cadmium:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4pUbN9gxhUI/TAvZR7hozJI/AAAAAAAAAAs/MdQWFuCXLuo/s1600/lead2.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 342px; height: 205px;" src="http://1.bp.blogspot.com/_4pUbN9gxhUI/TAvZR7hozJI/AAAAAAAAAAs/MdQWFuCXLuo/s400/lead2.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5479712273686121618" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The little slinky has 527 ppm cadmium and 1,365 ppm lead; the salt shaker has 22,634 ppm cadmium; the ice cream scoop has 1,188 ppm cadmium.  Here were a bunch of other small toys that had varying amounts of lead:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4pUbN9gxhUI/TAvZosEkGII/AAAAAAAAAA0/x0NNgXCE9CU/s1600/lead4.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 306px;" src="http://4.bp.blogspot.com/_4pUbN9gxhUI/TAvZosEkGII/AAAAAAAAAA0/x0NNgXCE9CU/s400/lead4.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5479712664674637954" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Small toys are especially spooky since babies are more likely to put them in their mouth.  These pictures show only about 1/3rd of the things I found with lead!  Here are some other surprising discoveries:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt; Old Christmas ornaments often have very high lead.  I had one silver ball that had ~1% arsenic, ~20% lead. Worse, it was well worn, which means the arsenic and lead had rubbed of onto people's hands, over the years.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Car keys have lead (~6,700 ppm for my 2 keys)!  Don't let your babies play with them.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Nearly every wire has high lead content in the insulation; christmas lights had especially high lead.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; The soft (usually black) plastic/rubber kids bike handles, and also tricycles, are often very high in lead.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; An old recliner had 29,500 ppm lead on the foot rest and 20,000 ppm lead on the arm rests.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Our door knobs, cabinet knobs, faucets, and spouts, had ~1-2% lead.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; We had a jar of kid's vitamins.  The vitamins tested OK, but the jar (it was a dark brown tint) had 140 ppm lead.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; One of our garden hoses had 5,255 ppm lead; not great because we had used this hose to water the plants in our edible garden.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Our waffle iron had 1,143 ppm lead on the cooking surface (likely, though, this was the metal under the teflon).&lt;br /&gt;&lt;br /&gt;&lt;li&gt; My supposedly lead-free solder (was not cheap!!) in fact contains 0.5% lead.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Here are some "rough" patterns:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt; The older the toy, the more likely it is to have high lead, cadmium, etc.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Newer things that are "burnable" (fleece, bedding, futons, mattresses, beds, plush chairs, etc.) often have very high (10s of thousands ppm) levels of bromine.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Colors like yellow, orange, red often have lead in their pigment.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Beware glasses that have paint on them!  We had one glass that was 8% lead -- 266 times the legal limit.  Colored glass (where the color is embedded into the glass) are also more likely to be leaded.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;I'm not the only consumer playing with XRf analyzers: a recent recall of 12M Shrek glasses by McDonald's was initiated by at least two people who &lt;a href="http://www.npr.org/templates/story/story.php?storyId=127474049&amp;ps=cprs"&gt;discovered high levels of cadmium in the paint on the glass using XRF analyzers&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-2834791922507032481?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/2834791922507032481/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/06/finding-lead-in-your-house.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2834791922507032481'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2834791922507032481'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/06/finding-lead-in-your-house.html' title='Finding the lead in your house'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_4pUbN9gxhUI/TAvXFTVMZQI/AAAAAAAAAAc/4Aweqn3JOMQ/s72-c/lead1.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5992778959458022640</id><published>2010-06-05T14:53:00.007-04:00</published><updated>2010-10-04T13:59:04.628-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Lucene'/><title type='text'>Lucene's PulsingCodec on "Primary Key" Fields</title><content type='html'>Flexible indexing in &lt;a href="http://lucene.apache.org"&gt;Lucene&lt;/a&gt; (now available on trunk, which will eventually be the next major release, 4.0) enables apps to use custom codecs to write/read the postings (fields, terms, docs, positions, payloads).&lt;br /&gt;&lt;br /&gt;By default, Lucene uses the &lt;em&gt;StandardCodec&lt;/em&gt;, which writes and reads in nearly the same format as the current stable branch (3.x). Details for a given term are stored in terms dictionary files, while the docs and positions where that term occurs are stored in separate files.&lt;br /&gt;&lt;br /&gt;But there is an experimental codec, &lt;em&gt;PulsingCodec&lt;/em&gt;, which implements the &lt;em&gt;pulsing&lt;/em&gt; optimization described in a &lt;a href="http://www.jopedersen.com/Publications/cutting90optimizations.pdf"&gt;paper by Doug Cutting and Jan Pedersen&lt;/a&gt;.  The idea is to inline the docs/positions/payloads data into the terms dictionary for low frequency terms, so that you save 1 disk seek when retrieving document(s) for that term.&lt;br /&gt;&lt;br /&gt;The &lt;tt&gt;PulsingCodec&lt;/tt&gt; wraps another fallback &lt;tt&gt;Codec&lt;/tt&gt; that you provide; this allows the pulsing to be dynamic, per term.  For each term, if its frequency (the number of documents that it appears in) is below a threshold (default 1) that you provide, then that term's postings are inlined into the terms dictionary; otherwise, the term is forwarded (&lt;em&gt;pulsed&lt;/em&gt;) to the wrapped codec.  This means &lt;tt&gt;PulsingCodec&lt;/tt&gt; should be helpful for ordinary text fields which obey &lt;a href="http://en.wikipedia.org/wiki/Zipf's_law"&gt;Zipf's Law&lt;/a&gt;, as many terms will be rare-ish.&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;PulsingCodec&lt;/tt&gt; should really shine on "primary key" fields, where each term occurs in exactly one document, and batch lookups (for example because the app performs deletes, updates and/or lookups) are common.&lt;br /&gt;&lt;br /&gt;I created a simple performance test to confirm this.&lt;br /&gt;&lt;br /&gt;The test first creates an optimized index with 10M docs, where each doc has a single field with a randomly generated unique term, and then performs term -&gt; doc lookup for N (parameter) random terms.  It's a self-contained test (source code is &lt;a href="http://www.codeupload.com/572"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;It's important to flush your OS's IO cache before running the test; otherwise you can't measure the reduced number of seeks.  On recent Linux kernels, just run &lt;em&gt;echo 1 &gt; /proc/sys/vm/drop_caches&lt;/em&gt;.  That said, in a real production usage, the IO cache will typically (legitimately) help you, and pulsing should make more efficient use of the IO cache since the postings data is contiguously stored.&lt;br /&gt;&lt;br /&gt;To measure the speedup from using &lt;tt&gt;PulsingCodec&lt;/tt&gt; on a primary key field, as well as the impact of the OS's IO cache, I ran the &lt;a href="http://www.codeupload.com/572"&gt;above test&lt;/a&gt; on an increasing number of random term lookups (always flushing the the OS's IO cache first):&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4pUbN9gxhUI/TAqnyf2_49I/AAAAAAAAAAU/yXDf6by8VUY/s1600/PulsingSpeedup.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 209px;" src="http://1.bp.blogspot.com/_4pUbN9gxhUI/TAqnyf2_49I/AAAAAAAAAAU/yXDf6by8VUY/s400/PulsingSpeedup.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5479376382637106130" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The results are compelling!  When performing a small number of term lookups relative to the total number of terms on a cold OS IO cache, which is likely the more common case in a real application, pulsing shows a ~45-50% speedup, as expected, since it requires 1/2 the seeks.&lt;br /&gt;&lt;br /&gt;As the number of random term lookups increases, &lt;tt&gt;PulsingCodec&lt;/tt&gt;'s gains decrease, because more and more of the lookups are hitting the OS's IO cache and thus avoiding the seek (the machine I ran the test on had plenty of RAM to cache the entire index).  It's interesting that &lt;tt&gt;PulsingCodec&lt;/tt&gt; still shows ~15% gain once the lookups are mostly cached; likely this is because &lt;tt&gt;PulsingCodec&lt;/tt&gt; saves the deref cost of finding the postings in the &lt;tt&gt;frq&lt;/tt&gt; file.&lt;br /&gt;&lt;br /&gt;Pulsing also makes the index a bit smaller (211 MB vs 231 MB), because it saves one vLong pointer per term.  For the test, the index with pulsing had a 0 byte &lt;tt&gt;frq&lt;/tt&gt; file since all postings were inlined into the terms dict.  There is no &lt;tt&gt;prx&lt;/tt&gt; file because I index the field with &lt;tt&gt;setOmitTermFreqAndPositions(true)&lt;/tt&gt;.&lt;br /&gt;&lt;br /&gt;Note that the &lt;a href="http://www.codeupload.com/572"&gt;test case&lt;/a&gt; simply uses PulsingCodec for all fields; if you'd like per-field control you should use the &lt;tt&gt;PerFieldCodecWrapper&lt;/tt&gt;.  However, because &lt;tt&gt;PulsingCodec&lt;/tt&gt; is dynamic (per term), it is likely a good default for all fields.&lt;br /&gt;&lt;br /&gt;Another way to speed up primary key lookups through Lucene is to store your index on a solid-state disk, where seeks are much less costly than they are on spinning magnets (though, still several orders of magnitude more costly than RAM).  Or better yet, do both!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5992778959458022640?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5992778959458022640/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5992778959458022640'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5992778959458022640'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html' title='Lucene&apos;s PulsingCodec on &quot;Primary Key&quot; Fields'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_4pUbN9gxhUI/TAqnyf2_49I/AAAAAAAAAAU/yXDf6by8VUY/s72-c/PulsingSpeedup.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-122172139304726608</id><published>2010-05-14T08:28:00.002-04:00</published><updated>2010-05-14T10:56:21.855-04:00</updated><title type='text'>Cancer and chemicals</title><content type='html'>The &lt;a href="http://deainfo.nci.nih.gov/advisory/pcp/pcp.htm"&gt;President's Cancer panel&lt;/a&gt; yesterday released their &lt;a href="http://deainfo.nci.nih.gov/advisory/pcp/pcp08-09rpt/PCP_Report_08-09_508.pdf"&gt;2008-2009 annual report&lt;/a&gt;.  It's a long (240 page) report, and what's amazing is that these are mainstream scientists who are now calling for precautionary reduction of our exposure to all sorts of chemicals we are now routinely exposed to.&lt;br /&gt;&lt;br /&gt;It has some sobering facts about cancer, such as:&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Approximately 41% of Americans will be diagnosed with cancer at some point in their lives, and 21% will die from it.&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;And:&lt;br /&gt;&lt;br /&gt;&lt;em&gt;The incidence of some cancers, including some most common among children, is increasing for unexplained reasons.&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;This &lt;a href="http://www.nytimes.com/2010/05/06/opinion/06kristof.html?src=me&amp;ref=homepage"&gt;this op-ed&lt;/a&gt; pulls out some great highlights from the report.  For example, here's a depressing one:&lt;br /&gt;&lt;br /&gt;  &lt;em&gt;Noting that 300 contaminants have been detected in umbilical cord blood of newborn babies, the study warns that: "to a disturbing extent, babies are born 'pre-polluted.' "&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;And then there's this quote:&lt;br /&gt;&lt;br /&gt;  &lt;em&gt;The report blames weak laws, lax enforcement and fragmented authority, as well as the existing regulatory presumption that chemicals are safe unless strong evidence emerges to the contrary.&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;Congress is now attempting to address this, with the &lt;a href="http://chbits.blogspot.com/2010/04/safe-chemicals-act.html"&gt;Safe Chemicals Act&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Here is a quote specifically about &lt;a href="http://chbits.blogspot.com/2009/08/on-bottled-water.html"&gt;Bisphenol A&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;  &lt;em&gt;Studies of BPA have raised alarm bells for decades, and the evidence is still complex and open to debate. That's life: In the real world, regulatory decisions usually must be made with ambiguous and conflicting data. The panel's point is that we should be prudent in such situations, rather than recklessly approving chemicals of uncertain effect.&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;This is an important point: during the time of uncertainty, when we don't know that a given chemical is dangerous nor do we know that it is safe, we should err on the side of caution, treating the chemical as guilty until proven innocent.  I discovered that there is a name for this approach: the &lt;a href="http://en.wikipedia.org/wiki/Precautionary_principle"&gt;precautionary principle&lt;/a&gt;, and this is of course the core change to the Safe Chemicals Act.&lt;br /&gt;&lt;br /&gt;Here's yet another quote, this time from a &lt;a href="http://www.cnn.com/2010/HEALTH/05/13/lead.poisoning.landrigan/index.html"&gt;brief article&lt;/a&gt; about one of the scientists who discovered lead exposure, even in tiny amounts, is very dangerous for kids:&lt;br /&gt;&lt;br /&gt;&lt;em&gt;We've been very careless in simply presuming that chemicals are innocent until proven guilty," says Dr. Phillip Landrigan.&lt;/em&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-122172139304726608?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/122172139304726608/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/05/cancer-and-chemicals.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/122172139304726608'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/122172139304726608'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/05/cancer-and-chemicals.html' title='Cancer and chemicals'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5680366854707046133</id><published>2010-05-13T11:14:00.002-04:00</published><updated>2010-05-13T11:35:02.293-04:00</updated><title type='text'>Plants and animals use quantum mechanics?</title><content type='html'>&lt;a href="http://en.wikipedia.org/wiki/Quantum_mechanics"&gt;Quantum mechanics&lt;/a&gt;, which Einstein once referred to as "spooky action at a distance", is a set of laws that govern how tiny (atomic &amp; sub-atomic) things interact with one another.  The laws are very different from classical physics, and really quite surprising, but nevertheless appear to be true (there's been much experimental validation).&lt;br /&gt;&lt;br /&gt;Using quantum mechanics, it's possible to build a quantum computer, and indeed many research labs and at least one &lt;a href="http://en.wikipedia.org/wiki/D-Wave_Systems"&gt;startup&lt;/a&gt;, have built simple ones.  Quantum computers can do some amazing things, such as &lt;a href="http://en.wikipedia.org/wiki/Shor's_algorithm"&gt;factoring integers very quickly&lt;/a&gt;, something classical computers can only do very slowly as the number gets bigger (as best we know, so far).&lt;br /&gt;&lt;br /&gt;Most recently, a simplistic quantum computer was used to &lt;a href="http://www.popsci.com/science/article/2010-05/single-molecule-computes-thousands-times-faster-your-pc"&gt;compute a discrete fourier transform using a single iodine molecule&lt;/a&gt;.  Someday quantum computers will be all over the place...&lt;br /&gt;&lt;br /&gt;But, isn't it possible that plants and animals have evolved to take advantage of quantum mechanics?  Indeed there is evidence that we have! &lt;br /&gt;&lt;br /&gt;The process of photosynthesis looks to be &lt;a href="http://www.nanowerk.com/news/newsid=16218.php"&gt;based on quantum entanglement&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;And, one leading theory about how animals can smell so well is based on &lt;a href="http://www.scientificamerican.com/article.cfm?id=is-sense-of-smell-powered"&gt;quantum vibrations&lt;/a&gt;.  Luca Turin, who created this theory, has a good quote in that article:&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Most people would probably feel that if it can be done at all, evolution has managed to make use of it.&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;And this makes sense - evolution is relentless at trying to find good ways to create plants and animals.  Since quantum mechanics is real, evolution should have tapped into it.&lt;br /&gt;&lt;br /&gt;This is also the reason I would expect &lt;a href="http://en.wikipedia.org/wiki/Lamarckism"&gt;Lamarckian inheritance&lt;/a&gt; to in fact be true.  Any animal that can alter the traits of its offspring based on experiences in its own lifetime would clearly have a big advantage, so, evolution really should have found a way.&lt;br /&gt;&lt;br /&gt;In fact, it sort of did, in a non-biological manner: language.  We can pass on all sorts of life lessons to our kids, through language, and we get to stand on the shoulders of past giants, as we take for granted the knowledge created by the generations before us, passed on through language.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5680366854707046133?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5680366854707046133/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/05/plants-and-animals-use-quantum.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5680366854707046133'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5680366854707046133'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/05/plants-and-animals-use-quantum.html' title='Plants and animals use quantum mechanics?'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7523256914792019434</id><published>2010-05-03T06:50:00.002-04:00</published><updated>2010-05-03T06:53:07.608-04:00</updated><title type='text'>Lots of problems these days...</title><content type='html'>I try to keep my kids roughly informed about what's going on in the world; I think it's important they grow up with at least a basic world view.&lt;br /&gt;&lt;br /&gt;Last night, my 7 year old son observed "you know there are alot of problems right now", and he's right!  He then rattled off the &lt;a href="http://www.youtube.com/watch?v=f1ztg0wUqKY"&gt;Iceland volcano eruption&lt;/a&gt;, the &lt;a href="http://www.boston.com/news/local/massachusetts/articles/2010/05/02/a_catastrophic_rupture_hits_regions_water_system"&gt;ruptured water main in our state&lt;/a&gt;, forcing us to boil water before drinking it, and the disastrous &lt;a href="http://en.wikipedia.org/wiki/Deepwater_Horizon_drilling_rig_explosion"&gt;oil rig explosion and subsequent and ongoing oil geyser&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7523256914792019434?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7523256914792019434/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/05/lots-of-problems-these-days.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7523256914792019434'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7523256914792019434'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/05/lots-of-problems-these-days.html' title='Lots of problems these days...'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-3084027320352269934</id><published>2010-05-03T06:35:00.003-04:00</published><updated>2010-05-03T06:44:02.981-04:00</updated><title type='text'>Concord bans sale of bottled water</title><content type='html'>Fabulous!  The town of Concord, MA has &lt;a href="http://www.boston.com/lifestyle/green/articles/2010/05/01/concord_fires_first_shot_in_water_battle/"&gt;banned the sale of bottled water&lt;/a&gt; because of the wretched environmental impact these bottles have.&lt;br /&gt;&lt;br /&gt;The environmental cost of such trash is stunning.  We know this fills up our landfills, but have you heard of  the &lt;a href="http://en.wikipedia.org/wiki/Great_Pacific_Garbage_Patch"&gt;Great Pacific Garbage Patch&lt;/a&gt;?  This is one of 5 spots in the oceans where garbage collects and kills marine life.&lt;br /&gt;&lt;br /&gt;Of course you really shouldn't &lt;a href="http://chbits.blogspot.com/2009/08/on-bottled-water.html"&gt;drink bottled water in the first place&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-3084027320352269934?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/3084027320352269934/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/05/concord-bans-sale-of-bottled-water.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3084027320352269934'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3084027320352269934'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/05/concord-bans-sale-of-bottled-water.html' title='Concord bans sale of bottled water'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-4235350627272207993</id><published>2010-05-02T09:30:00.003-04:00</published><updated>2010-05-02T09:32:53.197-04:00</updated><title type='text'>Your ideas and your name</title><content type='html'>I love this quote:&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Your ideas will go further if you don't insist on going with them.&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;It's very true (note that I didn't tell you who said it)!&lt;br /&gt;&lt;br /&gt;I find it especially applies to healthy open source projects.  In Apache, the individuals (contributors, committers) who work on a given project are fleeting, transient.  We will come and go.  Our names are not attached to the code we commit.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-4235350627272207993?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/4235350627272207993/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/05/your-ideas-and-your-name.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4235350627272207993'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4235350627272207993'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/05/your-ideas-and-your-name.html' title='Your ideas and your name'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-440859992998942319</id><published>2010-05-02T09:09:00.002-04:00</published><updated>2010-05-02T09:25:03.155-04:00</updated><title type='text'>Sinister search engine de-optimization</title><content type='html'>There's a &lt;a href="http://www.washingtonpost.com/wp-dyn/content/article/2010/05/01/AR2010050103051.html"&gt;large recall going on right now&lt;/a&gt; for many popular over-the-counter infant's and children's medicines, such as &lt;a href="http://www.mcneilproductrecall.com/page.jhtml?id=/include/new_recall.inc"&gt;Motrin&lt;/a&gt;, &lt;a href="http://www.mcneilproductrecall.com/page.jhtml?id=/include/new_recall.inc"&gt;Tylenol&lt;/a&gt;, and &lt;a href="http://www.mcneilproductrecall.com/page.jhtml?id=/include/new_recall.inc"&gt;Zyrtec&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;But if you look at the &lt;a href="http://www.mcneilproductrecall.com/page.jhtml?id=/include/new_recall.inc"&gt;recall details page&lt;/a&gt;, posted by the company that manufactures these medicines (McNeil), you'll see that the table is actually a single JPEG image instead of an HTML table.  If you don't believe me, try searching in your browser for the words you see in that table!&lt;br /&gt;&lt;br /&gt;At first I thought "how strange -- why would they use an image instead of a normal HTML table?".  But then a more sinister plot came to mind: perhaps they want to make it as hard as possible for future web searches to find this page.  After all, they must now be in major damage control mode.  It's the exact opposite problem of the more common &lt;a href="http://en.wikipedia.org/wiki/Search_engine_optimization"&gt;search engine optimization&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Hiding text into a JPEG image to avoid searches finding you is a rather nasty practice, in my opinion (hmm, I see there's even &lt;a href="http://www.hidetext.net"&gt;this service&lt;/a&gt; to help you do it!).  Google could prevent such sneakiness by running &lt;a href="http://en.wikipedia.org/wiki/Optical_character_recognition"&gt;OCR&lt;/a&gt; on the image (perhaps they do this already -- anyone know?), but then I suppose the war would escalate and we'd start seeing barely readable tables like this that look like the dreaded &lt;a href="http://en.wikipedia.org/wiki/CAPTCHA"&gt;Captcha tests&lt;/a&gt;.  To workaround such companies, if we all link to this page with some real text (as I've done above) then Google will still find it!&lt;br /&gt;&lt;br /&gt;It's also possible there's a more reasonable explanation for this maybe the good old saying "&lt;em&gt;never attribute to malice that which can be adequately explained by something else&lt;/em&gt;" somehow applies?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-440859992998942319?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/440859992998942319/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/05/sinister-search-engine-de-optimization.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/440859992998942319'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/440859992998942319'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/05/sinister-search-engine-de-optimization.html' title='Sinister search engine de-optimization'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7103959724786447734</id><published>2010-04-18T19:34:00.002-04:00</published><updated>2010-04-18T19:40:32.542-04:00</updated><title type='text'>Safe Chemicals Act</title><content type='html'>Finally, there's been a bill introduced to congress, called the Safe Chemicals Act, to better regulate the chemicals we all are exposed to in day to day products.&lt;br /&gt;&lt;br /&gt;The current law (from 1976) is ancient, and assumes any chemical is safe until proven otherwise -- innocent until proven guilty -- a dangerously lax approach which has resulted in the potentially dangerous chemicals we all have heard about, such as bisphenol A in certain plastics, brominated flame retardents, numerous phthalates released by vinyl (this causes that awful new shower curtain smell), etc.&lt;br /&gt;&lt;br /&gt;There are 80,000 chemicals in use today and the EPA has only required testing for 200 of them.  There are surely additional dangerous chemicals lurking, undetected.&lt;br /&gt;&lt;br /&gt;Refreshingly, the new bill takes the opposite approach, the same I approach I take with my family: a chemical is guilty until proven innocent!  This is why I use no pesticides on my &lt;a href="http://chbits.blogspot.com/2009/08/life-support-grass.html"&gt;life support grass&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;If this bill becomes law, manufacturers must prove a chemical is safe before they can use it in their products.  The burden of proof moves to the manufacturers. &lt;a href="http://www.time.com/time/health/article/0,8599,1982489,00.html"&gt;Here&lt;/a&gt;'s a good writeup of the bill.&lt;br /&gt;&lt;br /&gt;Lobbyists will undoubtedly fight this bill tooth and nail... so we need to push our congressional representatives!!  You can use &lt;a href="http://bit.ly/bluzJt"&gt;this site&lt;/a&gt; to do so.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7103959724786447734?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7103959724786447734/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/04/safe-chemicals-act.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7103959724786447734'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7103959724786447734'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/04/safe-chemicals-act.html' title='Safe Chemicals Act'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-1679741834093539673</id><published>2010-03-08T06:22:00.002-05:00</published><updated>2010-03-08T06:31:51.305-05:00</updated><title type='text'>Disagreements are healthy!</title><content type='html'>Another great quote, this time from Henry Ford:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;If two people always agree, one if them is unnecessary.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The quote applies very well to open source development -- what makes open source so strong is that such wildly diverse people, with different backgrounds, ideas, approaches, languages, IDEs, interests, itches, sleeping habits, coffee addictions, etc., come together and work on the same problems.&lt;br /&gt;&lt;br /&gt;And lots of disagreement ensues.&lt;br /&gt;&lt;br /&gt;As long as the resulting discussions are driven by technical merit and not personal attacks (ie, &lt;a href="http://theapacheway.com"&gt;The Apache Way&lt;/a&gt;), and consensus is eventually reached, then the disagreements are very powerful.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-1679741834093539673?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/1679741834093539673/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/03/disagreements-are-healthy.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1679741834093539673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/1679741834093539673'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/03/disagreements-are-healthy.html' title='Disagreements are healthy!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7833212505671582479</id><published>2010-02-25T08:31:00.002-05:00</published><updated>2010-02-25T10:13:19.974-05:00</updated><title type='text'>Serenity, courage and wisdom</title><content type='html'>&lt;div&gt;I love this quote:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;  &lt;i&gt;Strive for the serenity to accept the things you cannot change;&lt;/div&gt;&lt;div&gt;  courage to change the things you can; and wisdom to know the&lt;/div&gt;&lt;div&gt;  difference.&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's the opening to the Serenity Prayer (from the Bible), but I've taken the liberty of replacing &lt;i&gt;God grant&lt;/i&gt; with &lt;i&gt;strive for&lt;/i&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The idiom "pick your battles" means the same thing as the 3rd goal.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This great quote from &lt;a href="http://elise.com/quotes/quotes/shawquotes.htm"&gt;George Bernard Shaw&lt;/a&gt; relates closely to the 2nd goal:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;The reasonable man adapts himself to the conditions that surround him...&lt;/div&gt;&lt;div&gt;  The unreasonable man adapts surrounding conditions to himself...&lt;/div&gt;&lt;div&gt;  Therefore, all progress depends on the unreasonable man&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Think about these goals.  You'll likely find that each goal is distinct, very important, and somewhat surprisingly often applies in your day to day life.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7833212505671582479?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7833212505671582479/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/02/serenity-courage-and-wisdom.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7833212505671582479'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7833212505671582479'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/02/serenity-courage-and-wisdom.html' title='Serenity, courage and wisdom'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-6165532048966485552</id><published>2010-02-04T10:25:00.001-05:00</published><updated>2010-02-04T10:30:23.357-05:00</updated><title type='text'>Why do people vote against their own interests?</title><content type='html'>&lt;div&gt;I found &lt;a href="http://news.bbc.co.uk/2/hi/americas/8474611.stm"&gt;this article&lt;/a&gt; very interesting.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first thing that struck me is its detached tone -- sort of like the curious look a child gives when looking inside a cage at an exotic animal.  Probably this was written by someone who lives in Great Brittain, looking over the Atlantic ocean with mild curiosity at how crazy the US health care situation is.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The second thing that struck me is that the observation is very true. A huge majority of the population in this country would benefit from health care reform, including the public option.  Those who cannot afford health insurance now, those with pre-existing conditions, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Yet, many people who stand to benefit angrily fight reform.  It's just plain weird.  Here are two quotes from the article:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Why are so many American voters enraged by attempts to change a horribly inefficient system that leaves them with premiums they often cannot afford?&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;In Texas, where barely two-thirds of the population have full health insurance and over a fifth of all children have no cover at all, opposition to the legislation is currently running at 87%.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;Finally, the two books referenced by the article certainly look relevant, roughly concluding that the average American doesn't really make decisions based on facts.  (This also means "trial by a jury of your peers" is not a very comforting approach.) While I haven't read these books, I have come to the same depressing conclusion.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A democracy is only as effective as its population is at making rational decisions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-6165532048966485552?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/6165532048966485552/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/02/why-do-people-vote-against-their-own.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6165532048966485552'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/6165532048966485552'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/02/why-do-people-vote-against-their-own.html' title='Why do people vote against their own interests?'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8173603675981763967</id><published>2010-02-04T08:17:00.001-05:00</published><updated>2010-02-04T08:20:19.288-05:00</updated><title type='text'>Why not create a public bank?</title><content type='html'>&lt;div&gt;Why don't we have a public bank already?  I mean a bank run by the federal government that competes with private banks, keeping them honest.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Let's recap what's happened in the past few years.  First, the financial world collapsed (the risky sub-prime loans, CDOs, etc.). So, the federal gov't was forced to loan insane amounts of taxpayer's (and our children's future) money to the banks, just to barely keep them afloat.  By and large, that worked: the banks have survived and recovered.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Yet, today they still pay out insane bonuses to their top employees, still fly around in private jets, throw lavish parties, travel to exotic places, etc.  They are not extending the loans to small businesses that are required for our economy to really recover.  And they are now spending lots of money, fighting the legislation that would regulate things to prevent a future collapse from happening again.  In short, they have not changed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Something has gone terribly wrong!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;See, these banks don't contribute directly to economic progress.  They don't build houses, invent new products, grow food, create more fuel efficient cars, etc.  Really they are just the "lubrication" to enable all the real progress in our economy.  They are not suposed to make tons of money, yet they do.  And they most certainly shouldn't be given the power to hold our economy hostage, as they are today.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So why not create a public bank, run by the federal government, that would extend legitimate loans with reasonable terms, today?  This would put a strong competitive pressure on the existing banks to lend, as the public bank would otherwise take customers away.  It would be a much more direct way to stimulate the economy, than the "give lots of money to the banks and hope they loan it out instead of paying themselves fat bonuses" approach that is clearly not working very well today.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It could be a temporary creation, only around until the private banks start behaving well again.  Or it could remain indefinitely, keeping the banks honest over time.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8173603675981763967?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8173603675981763967/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2010/02/why-not-create-public-bank.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8173603675981763967'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8173603675981763967'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2010/02/why-not-create-public-bank.html' title='Why not create a public bank?'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8871763811178126444</id><published>2009-12-11T06:10:00.002-05:00</published><updated>2009-12-11T06:23:01.311-05:00</updated><title type='text'>Harmony is stuck in my head!</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;I often get songs stuck in my head, but I've noticed a curious thing: I'm especially proned to getting songs with a &lt;a href="http://en.wikipedia.org/wiki/Harmony"&gt;harmony&lt;/a&gt; stuck in my head.  It's a puzzle and my brain won't stoup until it's teased apart the primary melody and harmony. Literally it just keeps playing over and over, even as I type this.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.youtube.com/watch?v=meT2eqgDjiM"&gt;This is the song stuck in my head right now&lt;/a&gt;.  It's a great rendition of Michael Jackson's beat it.  I especially love the Xylophone -- it's partially &lt;a href="http://en.wikipedia.org/wiki/Syncopation"&gt;syncopated&lt;/a&gt; (plays, sometimes, on the off beat).  The harmony starts at 1:14.  Warning: it's addictive -- your brain, too, may not stop until it's figured out the harmony!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8871763811178126444?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8871763811178126444/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/12/harmony-is-stuck-in-my-head.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8871763811178126444'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8871763811178126444'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/12/harmony-is-stuck-in-my-head.html' title='Harmony is stuck in my head!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5455444448913627935</id><published>2009-11-09T08:23:00.003-05:00</published><updated>2009-11-09T08:33:13.778-05:00</updated><title type='text'>Direct Democracy</title><content type='html'>I think &lt;a href="http://www.NaturalNews.com/027439_Congress_democracy_America.html"&gt;this is a great idea&lt;/a&gt;, to allow the US population to vote, directly, on whether a bill should become law, instead of the indirect process we now use, trusting our congresspeople to vote on our behalf.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Everyone in the US knows how lobbyists, hired by corporations or other groups with lots of money, sway how our congresspeople vote by simply bribing them.  It's a disgusting situation, yet, somehow we all complacently accept it as normal.  I'm sure our founding fathers had no idea this would happen nor how much technology would advance.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Congress does other things, of course, like holding public hearings on important topics, etc., so I don't think it'll really be as simple as outright abolishing it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Unfortunately it seems very unlikely that the US will ever switch to a direct democracy.  But I for one would sure love to see it.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5455444448913627935?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5455444448913627935/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/11/direct-democracy.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5455444448913627935'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5455444448913627935'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/11/direct-democracy.html' title='Direct Democracy'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-7977429160730559348</id><published>2009-10-24T07:28:00.003-04:00</published><updated>2009-10-24T07:45:35.608-04:00</updated><title type='text'>Discrimination against kids</title><content type='html'>&lt;div&gt;A few weeks back we went camping with the kids; it was great fun!  I did tons of camping as a kid and I've been eagerly looking forward to our kids being old enough to go.  They finally are.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My wife found a rather luxurious campground: it had indoor pools, jacuzzi, outdoor pools, places to buy food, drinks, firewood, etc.  It was generally very kid friendly -- lots of activities for kids, play room, etc.  We were quite spoiled and we loved it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, there was a strict rule that kids are not allowed in the jacuzzi nor the sauna.  When we pulled out inflatable pool toys, we were politely informed that they, too, are not allowed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, don't get me wrong: these rules are common these days.  Many places with pools and jacuzzis and saunas will have the same rules. These rules didn't exist when I was a kid, but now they are commonplace and accepted as normal.  My kids just accepted them, in stride, as kids will do.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But I am bothered: this is quite simply discrimination against kids.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I understand why the rules are there.  Thanks to our overly litigious society, if kids get hurt in the pool or overheat in the jacuzzi, the parents sue the campground rather than accept their own negligence. And unfortunately, they often win, or, settle out of court.  This is not unlike McDonalds being sued for making coffee that's too hot.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At the same time, we also have become a "safety above all" society, for better or worse.  We must wear helmets when we bike, elbow pads and knee pads when we get on our roller blades, seat belts and car seats when we go driving, life preservers if we get anywhere near water.  Kids barely venture out into their front yard, let alone out around the neighborhood, for the [overblown] fear that something bad could happen.  Examples abound.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As a parent I fully appreciate the dangers of water.  In fact water is a terrifying combination: it is at once incredibly fun yet also deadly.  To a child who can't yet swim, a pool is literally a potential death trap, and it's the parent's job to keep their kids safe.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Thus, the insurance costs go up, likely by quite a bit, if the campground allows such "unsafe" activities, and so they choose to discriminate against kids and save some money.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There was a time when slavery was common place, but that certainly didn't make it right.  Just because kid discrimination has now become widely accepted does not make it right.  If you encounter discrimination against your kids when you travel, please write up a review making this clear so future families can plan accordingly.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-7977429160730559348?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/7977429160730559348/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/10/discrimination-against-kids.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7977429160730559348'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/7977429160730559348'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/10/discrimination-against-kids.html' title='Discrimination against kids'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5661570059568108346</id><published>2009-09-30T07:45:00.001-04:00</published><updated>2009-09-30T07:50:37.332-04:00</updated><title type='text'>A better grass?</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;I came across this article about a &lt;a href="http://www.boston.com/business/articles/2009/07/04/wayland_man_has_developed_seeds_that_produce_a_greener_grass/"&gt;newly developed grass&lt;/a&gt; that does not require the normal intense life-support we all have come to assume is "normal".&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You don't have to water it (after the initial seeding), nor apply pesticides nor fertilizer.  And it only requires mowing once per month instead of the typical once per week schedule for life-support grass.  It was designed to simply survive, naturally, in our challenging northeast climate.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I've always felt that such a grass must exist, but that the existing grass seed companies would not be interested in pursuing it.  See, if the grass simply takes care of itself, we all will buy much less grass seed over time.  The lawn care service industry will see much less business, mowing our lawns monthly instead of weekly.  Manufacturers of pesticides and fertilizers and lawn care equipment will see less demand, etc.  It's quite clearly not in the interest of the lawn care industry to pursue nor allow such innovation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I sure hope this grass is successful, but the pessimist in me expects that in a few years time, either this company will have been sued out of existence, or the rights to this grass will have been purchased for a princely sum, and then promptly shelved, by one of the big established players in the grass seed industry.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For better or worse, capitalism favors waste in mature markets.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5661570059568108346?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5661570059568108346/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/09/better-grass.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5661570059568108346'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5661570059568108346'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/09/better-grass.html' title='A better grass?'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-2766651501349002167</id><published>2009-09-24T07:57:00.002-04:00</published><updated>2009-09-24T08:03:12.059-04:00</updated><title type='text'></title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Today, on my morning run, I saw a student walking, late for the bus. The bus saw her walking, way down the road and so stopped and waited for probably two minutes or so for her to catch up and get on. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This might seem like only reasonable behavior, on the bus driver's part.  S/he was being nice, right?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As crazy as it sounds, while it was a nice thing to do, I don't think the bus should have stopped.  Here's why.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It sends the message that one student's inability to be on time is allowed to cut into the time the rest of the students get at school. The needs of the one outweigh the needs of the many (thank you &lt;a href="http://www.imdb.com/title/tt0084726/quotes"&gt;Spock&lt;/a&gt;).  It's only two minutes, but if this happens a few times on the route, day in and day out, that adds up to net/net less time at school for all the kids.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The rest of the students, who made the bus on time, probably having rushed through their morning at home to do so, pay the price for those students who can't make the bus on time.  They will conclude that they, too, can be a bit late and the bus will wait.  Why bother rushing to be on time?  Rather than being taught that they should try hard to make the bus on time, to take responsibility for not making others wait, they are taught the reverse.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally, seeing the bigger picture, this teaches kids that the world will stop and wait for them.  Make up for their faults.  Be forgiving. That you need not try very hard for things because the rest of the world will compensate.  You need not take responsibility.  It ties right into the dangerous sense of entitlement that many kids seem to have now.  For better or worse, the world simply is not like that once you grow up.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;She should have simply missed the bus and learned a good lesson.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-2766651501349002167?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/2766651501349002167/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/09/today-on-my-morning-run-i-saw-student.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2766651501349002167'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/2766651501349002167'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/09/today-on-my-morning-run-i-saw-student.html' title=''/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8012040356229579492</id><published>2009-09-05T17:09:00.003-04:00</published><updated>2009-09-05T20:19:09.953-04:00</updated><title type='text'>Fun questions</title><content type='html'>Here are two fun questions I've [temporarily] stumped my kids on:&lt;div&gt;&lt;ul&gt;&lt;li&gt;How can gravity make something go up?&lt;/li&gt;&lt;li&gt;How can the moon get you wet?&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8012040356229579492?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8012040356229579492/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/09/fun-questions.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8012040356229579492'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8012040356229579492'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/09/fun-questions.html' title='Fun questions'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-3006124472903089015</id><published>2009-09-01T07:51:00.002-04:00</published><updated>2009-09-01T08:19:10.131-04:00</updated><title type='text'>Spell correction</title><content type='html'>&lt;div&gt;Spell correction is a challenging feature for search engines. Unfortunately, it's also crucial: mis-spelling is rampant when users run searches.  In part this is because we all can't remember how to spell, and that's no wonder: the number of English words today is &lt;a href="http://www.youtube.com/watch?v=pMcfrLYDm2U"&gt;5X what it was in Shakespeare's time&lt;/a&gt;! But it's also because we are simply in a hurry, or, lazy, and make many typos.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I rely on &lt;a href="http://aspell.net/"&gt;aspell&lt;/a&gt; when I'm using &lt;a href="http://www.gnu.org/software/emacs"&gt;emacs&lt;/a&gt;.  Modern web browsers and word processors check the spelling of all text you enter. Web-side search engines have excellent spell correction; in fact, I no longer bother to correct my typos when entering a search.  I've often wondered whether such "crutches" of our modern world are in fact weakening our minds and perhaps causing our language to further evolve?  For example, I wonder how Microsoft Word's often wrong (in my experience) grammar checker has crimped "modern" writing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My Chemistry teacher in high school refused to allow us to use calculators during our tests, for fear that we would lose our ability to do math with only basic tools (paper, pencil, brain, hands).  My Physics teacher did the opposite, for the reverse fear that the distraction of doing basic math would take precious time and thought away from actually thinking about how to solve the problems.  Who's right?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Google clearly sets the gold standard for respelling, that any search engine is now required to live up to.  If you don't match that high bar, users are automatically disappointed.  And you really don't want to disappoint your users: it's nearly impossible to get them to try out your new application, and, they often don't give second chances.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For most approaches to spell correction, the more data you throw at them the better they perform.  If you have lots of queries coming in, you can use that as your sole source.  Google of course has tons of queries to tap into. If you are less fortunate, you can use your index/documents as your source.  Both of these approaches assume most people know how to spell well!  The assumption seems to hold, for now, but I have to wonder, as we all lean on this crutch and become worse at spelling with time, won't this eventually undermine Google's approach?  No worries; Google will adapt.  This is not unlike investing in index funds: that approach only works well if relatively few people do it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Lucene's &lt;a href="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/spell/SpellChecker.html"&gt;basic spellchecker package&lt;/a&gt;, under contrib, which requires you to provide a Dictionary of "known words", allows you to derive these words from your search index.  It has some limitations: it can only do context-free correction (one word at once, independent of all other words in the query); it doesn't take word frequency in the index into account when deriving the index (so if a typo gets into your index, which can easily happen, you could end up suggesting that typo!); etc.  But it does provide a pluggable distance measure for picking the best candidate.  It's a good start.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One particularly sneaky feature to get right is spell correction in the context of entitlements; &lt;a href="http://lucene.markmail.org/search/?q=java-user#query:java-user%20order%3Adate-backward+page:1+mid:tsthwag5byzbo6v3+state:results"&gt;my post this morning on Lucene's user list&lt;/a&gt; raises this problem in a real use case (single index to search multiple user's emails).  Entitlements means restricting access for certain users to certain documents.  For example, you could have a large search index containing all documents from your large intranet, but because of security on the intranet, only certain users are allowed to access certain documents.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Lucene makes it easy to implement entitlements during searching, by using static (based solely on what's indexed) or dynamic (based on some "live" external source at search time) filtering.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, properly doing spell correction in the presence of entitlements is dangerous.  If you build a global lexicon based on your index, that lexicon can easily "bleed" entitlements when there are terms that only occur in documents from one entitlement class.  This might be acceptable for context-free spell correction, but if your spell correction has context (can suggest whole phrases at a time) you could easily bleed a very dangerous phrase (eg, "Bob was fired") by accident.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, you might choose to splinter your spell correction dictionary by user class, but that could result in far too little data per user class.  I'm not sure how to solve it, well; it's a challenging problem.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I hope I haven't mis-spelled any words here!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-3006124472903089015?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/3006124472903089015/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/09/spell-correction.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3006124472903089015'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3006124472903089015'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/09/spell-correction.html' title='Spell correction'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5568579236867864785</id><published>2009-08-28T10:36:00.002-04:00</published><updated>2009-08-28T13:28:02.636-04:00</updated><title type='text'>The best way to learn is to teach</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;They say the best way to learn something is to teach it.  Well, I say writing a book sure counts as teaching because writing the &lt;a href="http://www.manning.com/hatcher3/"&gt;2nd edition of Lucene in Action&lt;/a&gt; sure has taught me all sorts of juicy details about Lucene, far more than I would have learned on my own.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I can also say that writing a book about an active open-source project is very demanding.  I try to keep the manuscript current, as changes are happening to Lucene, but then more than once I've been burned by keeping it just a little too current, only to see that the community up and changed its mind on something I had already folded into the book's manuscript and source code!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Finally, as Lucene is getting &lt;a href="http://lucene.markmail.org/message/a4fw4icbwdmnkqfg"&gt;very close to releasing 2.9&lt;/a&gt; I'm now scrambling to fix the loooong tail of little things all throughout the book.  Even once I finish that, it's several more months for a deep technical review, Manning's production process, etc.  I'm looking forward to finishing!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5568579236867864785?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5568579236867864785/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/08/best-way-to-learn-is-to-teach.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5568579236867864785'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5568579236867864785'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/08/best-way-to-learn-is-to-teach.html' title='The best way to learn is to teach'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-9159452948208759524</id><published>2009-08-23T09:05:00.002-04:00</published><updated>2009-08-23T09:09:16.873-04:00</updated><title type='text'>A kid's mind</title><content type='html'>&lt;div&gt;I just love how kids think.  It's so carefree and unrestrained by all the silly "limitations" we adults have learned with time.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's an example: we just finished a vacation with cousins (my kids' cousins).  They were all in the car yesterday, eagerly discussing Halloween and who's going to wear what costume.  But then they all realized and lamented that in fact they would not be together for Halloween.  So, immediately, my son said "Dad, can we fly to California for Halloween?".&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It just tickled me pink!  See, we live in Boston, MA, so flying to California is easily a 9 hr affair, one way, "door to door".   Not to mention, expensive!   But my son's thinking of course wasn't restrained by such silly things.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As adults the idea would never be allowed to cross our mind. Somewhere, deep in our brains, is a group of neurons that quickly and mercilessly kills off such thoughts before we can even think them. But there's no such limitation, yet, with a kid's mind.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you spend even a small amount of time with any child, you'll see many examples of this unrestrained thinking, and it's delightfully refreshing.  We all should strive not to grow up.  It's easily the best thing you could do for yourself!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-9159452948208759524?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/9159452948208759524/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/08/kids-mind.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9159452948208759524'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9159452948208759524'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/08/kids-mind.html' title='A kid&apos;s mind'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-4442227358562894152</id><published>2009-08-23T08:34:00.004-04:00</published><updated>2009-08-28T07:55:21.928-04:00</updated><title type='text'>On bottled water</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;The bottled water industry is truly silly.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's an enormous money-maker, a $20B industry in the US alone.  You're buying a product that's hundreds of times more expensive than tap water.  Yet, often the bottled water simply comes from a municipal source anyway.  Furthermore, it's easily less safe than your tap water.  See, the EPA has tougher regulations for tap water than the FDA has for bottled water. For example, the FDA allows some contamination of &lt;a href="http://en.wikipedia.org/wiki/Escherichia_coli"&gt;E Coli&lt;/a&gt;, and does not require testing for known parasites such as &lt;a href="http://en.wikipedia.org/wiki/Cryptosporidiosis"&gt;Cryptosporidium&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/Giardia_lamblia"&gt;Giardia Lamblia&lt;/a&gt;.  Likely your bottled water does not contain &lt;a href="http://en.wikipedia.org/wiki/Fluoride"&gt;Fluoride&lt;/a&gt; as well.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Not to mention the insane consumption of oil required to schlepp around all this bottled water and then again to discard the empty plastic bottles.  You should of course recycle them, but precious few of us actually do and so they fill up landfill, a "gift" from us to our future generations.  Or perhaps your empty bottles end up in the &lt;a href="http://en.wikipedia.org/wiki/Great_Pacific_Garbage_Patch"&gt;Great Pacific Garbage Patch&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;And as if all of this weren't already reason enough to avoid bottled water, there is the curious problem of the chemicals in plastic, such as &lt;a href="http://en.wikipedia.org/wiki/Bisphenol_A"&gt;Bisphenol A&lt;/a&gt; (BPA), leaching into the water over time.  Previously, it was believed that these chemicals didn't easily leach unless the plastic was hot (this is why you're not suposed to put plastic containers in the microwave or dishwasher).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, &lt;a href="http://focus.hms.harvard.edu/2009/061909/research_briefs.shtml"&gt;this delightful study&lt;/a&gt; showed that simply drinking bottled water increased BPA in urine by 2/3rds.  Perhaps bottled water should include a clear "bottled on" date so you can at least roughly gauge how much BPA you're about to drink.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Plastic is clearly an incredibly useful material, and I'm sure we'll eventually sort out the problems of the various chemicals that leach from it.  In the meantime, I simply play it safe by avoiding plastic touching our food/drink, when practical.  For example, we only put glass ware in the microwave, and when we need to carry water on-the-go, we always use a &lt;a href="http://www.kleankanteen.com/"&gt;Klean Kanteen&lt;/a&gt;.  In fact we now have many Klean Kanteens: in the car, in the stroller, next to the kid's beds and in our home offices, on the dining room table, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-4442227358562894152?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/4442227358562894152/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/08/on-bottled-water.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4442227358562894152'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4442227358562894152'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/08/on-bottled-water.html' title='On bottled water'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-3083545320836366582</id><published>2009-08-22T07:13:00.004-04:00</published><updated>2009-08-22T07:32:14.911-04:00</updated><title type='text'>Hurricane Bill is coming!</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;I find myself, this quiet Saturday morning, in Falmouth MA, staring down the barrel of &lt;a href="http://www.weather.com/newscenter/hurricanecentral/2009/bill.html"&gt;Hurricane Bill&lt;/a&gt;!  Seriously, it's headed straight for us, having gained strength last night.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Yet, we're planning to go happily to the beach this morning, anyway.  Why play chicken with a hurricane?  Because the computer models at the &lt;a href="http://www.nws.noaa.gov/"&gt;National Weather Service&lt;/a&gt; insist that Bill will take a turn northward, sometime very soon, and not in fact touch Cape Cod at all (though it is projected to make landfall in Nova Scotia Sunday PM).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We place alot of confidence in our computer models these days, and I sure hope they're right.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I was in Falmouth for &lt;a href="http://www.geocities.com/hurricanene/hurricanebob.htm"&gt;Hurricane Bob&lt;/a&gt; in 1991 and it was stunning.  Have you ever tried to stand up when 100 mph sustained winds are blowing at you?  It's quite an experience, and the resulting damage was unreal.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-3083545320836366582?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/3083545320836366582/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/08/i-find-myself-this-quiet-saturday.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3083545320836366582'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3083545320836366582'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/08/i-find-myself-this-quiet-saturday.html' title='Hurricane Bill is coming!'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8000931030490659007</id><published>2009-08-13T19:35:00.003-04:00</published><updated>2009-08-13T19:39:57.220-04:00</updated><title type='text'>Anticipation is half the fun</title><content type='html'>I've found that for most people, but especially kids, the anticipation leading up to something is a sizable part of the fun.  If you have an exciting vacation coming up, take every chance to remind your kids that it's coming, what the plans are, etc.  If you don't, they've missed out on half the fun!&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'm also convinced that this is why so many people can go to the Disney parks.  The lines are amazingly long, and the actual rides amazingly short.  It's the anticipation of going on a ride that keeps you happy.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8000931030490659007?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8000931030490659007/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/08/anticipation-is-half-fun.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8000931030490659007'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8000931030490659007'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/08/anticipation-is-half-fun.html' title='Anticipation is half the fun'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-3757391833734431178</id><published>2009-08-04T12:38:00.002-04:00</published><updated>2009-08-04T13:29:06.840-04:00</updated><title type='text'>Life support grass</title><content type='html'>&lt;div&gt;&lt;div&gt;Seriously, could we possibly have picked a worse plant for our yards than "grass", even if we tried?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I mean the stuff is so sickly, it has no prayer of living on its own in our local climate.  It requires massive amounts of life support just to barely eek out an existence.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'll admit, with all the life support, the stuff can be truly beautiful.  But the price we pay to reach that beauty is ridiculous.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We dump hundreds of gallons of water on the stuff, every night, using sprinkler systems now built standard into the ground for new homes. This of course leaches all nutrients from the soil.  Worse, grass is nitrogen-leaching: it removes nitrogen from the soil.  So we must dump on loads of fertilizer to put it back.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Being on such luxurious life support, the grass grows like it's in a jungle, and so we are forced to mow it, at least once per week.  While leaving those clippings in place would make great natural fertilizer (after all, this is where all the fertilizer went!), it's not pretty so we truck the clippings away and dump them somewhere else.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course, lots of other plants thrive, too, even better than grass. We like to call them "weeds", since they are not grass.  And so we must dump toxins over the yard to kill them.  We dump separate toxins to kill all sorts of bugs.  Sometimes, from too much water, a fungus develops, so we dump something else on to kill that.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;All this stuff we dump on the yard likely endangers our kids, us, and our pets, but somehow we don't seem to care.  It also kills off the worms that'd naturally aerate the soil, and so we must do our own forced mechanical aeration.  It messes up the pH balance, so we dump yet more stuff on (lime, sulphur) to fix that.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In fall, when the leaves drop on the weak grass, we must quickly rake or blow them off, because the grass dies off quickly if it's left covered by leaves.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After winter, which the grass barely survives, some of it has died off and turned brown.  This is fully natural, and that dead grass would normally serve as nature's fertilizer, yet we don't like the color, so we dethatch and reseed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;These practices don't end with grass, of course.  We truck in loads of mulch, the lipstick of modern lawn care, each year.  We spray all sorts of toxins on the trees, the bushes, etc.  Terminex shows up, spraying all sorts of other toxins.  People with blowers show up and blow every last little thing off the asphault of your driveway.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This whole ritual is now commonplace.  It's assumed, accepted and expected practice.  If you don't subscribe the life-support grass movement, people think something is wrong with you.  How did we get ourselves into such a mess?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Can't we, instead, find a plant that has no trouble surviving in our natural climate, with zero life support?  Why did we all fall in love with this sickly life support grass, anyway?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For example, crabgrass thrives.  It's very hardy, grows with no additional watering, takes care of seeding by itself while grass never succeeds in seeding itself (presumably it's been selected and bred &lt;b&gt;not&lt;/b&gt; to).  Clover is another example, and has the advantage of being nitrogen fixing (the exact opposite of grass): it extracts nitrogren from the air and puts it back into the soil.  This is why it's such a dark green even without fertilizer.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Surely we can do better.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-3757391833734431178?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/3757391833734431178/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/08/life-support-grass.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3757391833734431178'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/3757391833734431178'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/08/life-support-grass.html' title='Life support grass'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-433326531793779619</id><published>2009-07-27T06:25:00.002-04:00</published><updated>2009-07-27T06:36:45.717-04:00</updated><title type='text'>ZFS and ECC RAM</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;I started &lt;a href="http://opensolaris.org/jive/thread.jspa?threadID=108609"&gt;this thread&lt;/a&gt;  over on the ZFS discuss list.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My question dug into why the ZFS bigwigs alway so strongly recommend ECC RAM.  Was it simply for the added security of preventing a few corrupted files (because non-ECC RAM will likely flip a few bits in the lifetime of your computer)?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Or... was there something more spooky going on such that something catastrophic (losing an entire RAID-Z pool) is possible if your RAM has a bit error at a particularly bad time?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is important to know, because when spec'ing out a new server, you need the facts in order to make proper cost/benefit tradeoffs.  This decision should be no different from whether I should get the latest and greatest hard drive (risky, since it has no track record), or an known-good older generation drive that has less capacity and performance but has a good record.  If non-ECC RAM means I risk losing the pool then I'll fork out the extra $$$!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's great to see that ZFS has such a vibrant community that my simple question received so many answers.  In this day and age the health of the community behind the software you are using is more important than the health of the software itself!  &lt;a href="http://lucene.apache.org"&gt;Lucene&lt;/a&gt; also has a great community, though, I'm biased!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My thread also indicates one of the challenges with open-source: sometimes you can't get a "definitive" answer to questions like this.  Many people chimed in with "opinions", on both sides, but if I tally up the votes, and take into account the number of posts (rough measure of "authority") behind each vote, more people say "a bit error will just corrupt files/directories" than "a bit error can lose the pool".&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The dicussion also pointed out &lt;a href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6667683"&gt;this very-important issue&lt;/a&gt;, which is to create a way to rollback ZFS to a prior known good state. It's the closest ZFS will get to providing something like fsck, I think.  Sort of spooky ZFS doesn't already have that.  I hope by the time I need it, if ever, this issue is done!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-433326531793779619?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/433326531793779619/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/07/zfs-and-ecc-ram.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/433326531793779619'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/433326531793779619'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/07/zfs-and-ecc-ram.html' title='ZFS and ECC RAM'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-8337544869406314378</id><published>2009-07-18T06:08:00.003-04:00</published><updated>2009-07-18T10:14:36.889-04:00</updated><title type='text'>WDTLER and WDIDLE3</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Western Digital states that the &lt;a href="http://www.wdc.com/en/products/products.asp?driveid=576"&gt;Caviar GP drives&lt;/a&gt; are not recommended for RAID arrays, and that instead you should get their &lt;a href="http://www.wdc.com/en/products/products.asp?driveid=610"&gt;enterprise RE-4 drive&lt;/a&gt;.  But there's a $100 price difference between the two right now!  (&lt;a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16822136344"&gt;$230&lt;/a&gt; vs &lt;a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16822136365"&gt;$330&lt;/a&gt; at Newegg).  So I decided to risk it and build my RAIDZ array with the GP drives.  Check back in a couple of years to see if I have any regrets!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In building the array I discovered two very important fixes I needed to make to the drives, in order to make them behave more like the RE-4 drives.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;First was to enable &lt;a href="http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery"&gt;Time-Limited Error Recovery&lt;/a&gt;. This tells the drive to NOT make ridiculous efforts to recover a sector that it's having trouble reading, and to instead quickly report back an error that the sector could not be read.  See, if the drive takes too long to answer a read request, the RAID level will assume it has gone kaput and boot it from the array.  By enabling TLER, you prevent this from happening, thus letting the RAID level handle the error.  Use the &lt;a href="http://shifteightgeneration.com/content/wdtler-fix-tler-setting-wd-desktop-hard-drives"&gt;WDTLER&lt;/a&gt; utility to do this.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Second, the GP drives have a feature called Intellipark, which parks the drives heads (moves them off the platters) so as to reduce air resistance drag on the motor that spins the platter (every little power saving counts!).  You can hear it clearly when it kicks in: it makes a slight clicking sound when parking.  When you need to use the drive again, there's a clear delay and new clicking sound while the disk head unparks.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;While nice in theory, it's unfortunately rather frustrating in practice.  See, modern OS's use write caching to gather up a bunch of writes in RAM, and only actually write to the hard drives in bulk, every 10-30 seconds.  The GP's idle timer is 8 seconds by default (a rather poorly chosen default).  As a result the drive incessantly parks and unparks as random services write a few bytes here and there.  Eventually, too many such cycles (I've read in forums that 300,000 is the spec'd limit) will cause wear &amp;amp; tear and increase the chance of failure.  &lt;a href="http://kerneltrap.org/mailarchive/linux-kernel/2008/4/9/1386304"&gt;This thread&lt;/a&gt; on the Linux Kernel mailing list gives some details.  While this is a problem even in non-RAID settings, it's exacerbated by RAID because now you have N drives that park/unpark, in sequence.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Fortunately, there's another utility called &lt;a href="http://forums.storagereview.net/index.php?showtopic=27269"&gt;WDIDLE3&lt;/a&gt; that lets you increase the time (to a max of 25.5 seconds, which I don't think is enough), or to disable the timer entirely, which is what I did.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you don't run Windows and thus cannot directly run these EXEs, one simple workaround is to slipstream them into the &lt;a href="http://www.ultimatebootcd.com/"&gt;Ultimate Boot CD&lt;/a&gt; as &lt;a href="http://shifteightgeneration.com/content/wdtler-fix-tler-setting-wd-desktop-hard-drives"&gt;described here&lt;/a&gt;.  Those instructions are for WDTLER specifically, but simply slip in WDIDLE3 at the same time.  Keep the resulting CD accessible since you'll likely need to run it again if you have to replace any drives in your array!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As best I can tell, Western Digital does not officially support these utilities, so use them at your own risk.  They both worked fine for me, on OpenSolaris, but your mileage may vary!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-8337544869406314378?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/8337544869406314378/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/07/fixing-wd-gps-drives-with-wdtler-and.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8337544869406314378'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/8337544869406314378'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/07/fixing-wd-gps-drives-with-wdtler-and.html' title='WDTLER and WDIDLE3'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-5799840138585076526</id><published>2009-07-18T05:48:00.004-04:00</published><updated>2009-07-29T06:34:29.810-04:00</updated><title type='text'>Newegg vs Amazon</title><content type='html'>&lt;div&gt;&lt;div&gt;I'm using 6 of the &lt;a href="http://www.wdc.com/en/products/products.asp?driveid=576"&gt;2 TB  Western Digital Caviar GP&lt;/a&gt; drives in my new build, in a &lt;a href="http://blogs.sun.com/bonwick/entry/raid_z"&gt;RAID-Z array&lt;/a&gt;.  Despite reading horror stories online, eg the many users seeing drives die quickly in the &lt;a href="http://www.newegg.com/Product/ProductReview.aspx?Item=N82E16822136344"&gt;customer reviews at Newegg&lt;/a&gt;, mine are working great despite the sizable stress tests I've been running.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Except: one of my drives keeps reallocating sectors.  I see this in its SMART diagnostics (5 sectors as of 2 weeks ago, 14 reallocated as of yesterday).  This isn't normal (eg, the other 5 drives have 0 reallocated sectors), so I'll be keeping an eye on it and at some point might ask WD for a warranty replacement.  I wonder if there's an accepted "policy" on how many reallocated sectors is too many?  This reminds of the numerous "how many dead pixels are too many" discussions for new LCD monitors.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course, I don't lose any data because of this; ZFS's RAIDZ simply corrects the error for me.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I bought 3 of the drives from &lt;a href="http://www.newegg.com/"&gt;Newegg&lt;/a&gt; and 3 from &lt;a href="http://www.amazon.com/"&gt;Amazon&lt;/a&gt;.  If I were more patient I would have spread them out over time as well.  In general you should buy your drives across space and time, to minimize the chance of "correlated failures".  If you buy all your drives from the same place, it's likely they were manufactured in the same "batch" which means any manufacturing defect in the production of that batch would make it more likely that you'd lose 2 or more drives at once, thus destroying all your data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Newegg, it turns out, does a poor job shipping hard drives.  They simply wrap them in bubble wrap and tape it up, sometimes packing 2 drives together inside the bubble wrap.  What they don't realize is, because of rough handling from UPS, those bubbles pop, one by one I imagine, during transit, such that by the time I receive it, there is zero protection (no bubbles left) along at least one edge of the hard drives.   It's rather shocking because Newegg is otherwise excellent.  I've read several posts in the user comments noting exactly what I just said, yet Newegg hasn't improved.  It's a bad sign when a company stops listening to its customers.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In contrast, Amazon (whose price matched Newegg's) packed each drive into it's own dedicated foam packing and box.  Fabulous!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;[&lt;b&gt;EDIT Jul 28, 2009&lt;/b&gt;: I just received one more drive from Amazon, and they unfortunately have taken a turn for the worse!  They now ship in a similar fashion to Newegg, wrapping the drive in minimal bubble wrap which pops during transit. They also take the wasteful step of "box within a box", which I don't think adds much protection to the drive.  This drive will be my "hot spare", so if/once it get swapped into the array, I'll try to remember to watch for reallocated sectors and any other problems.  Sigh.]&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The one drive I see failing was in fact one from Newegg (I kept track of the serial numbers); it's entirely possible Newegg's poor shipping and the rough handling from UPS led to this drive's failure.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Losing one drive in a RAID array is quite terrifying because until you get the new drive &lt;a href="http://blogs.sun.com/bonwick/entry/smokin_mirrors"&gt;resilvered&lt;/a&gt;, you're running with no safety margin!  If you lose another drive, you've lost all your data.  RAIDZ is not a replacement for good backups.  It's best to have a spare drive on hand; you can even install it and notify ZFS to keep the drive as a hot spare, meaning if any drive drops out of the array, ZFS will immediately start the resilvering process to bring the new drive in.  Or you could create a RAIDZ2 array, which has two drives worth of redundancy, but then you've "lost" 2 drives worth of storage!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-5799840138585076526?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/5799840138585076526/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/07/buying-hard-drives-from-newegg-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5799840138585076526'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/5799840138585076526'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/07/buying-hard-drives-from-newegg-and.html' title='Newegg vs Amazon'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-4527684010889842674</id><published>2009-07-17T11:02:00.006-04:00</published><updated>2009-08-15T11:45:59.134-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenSolaris'/><title type='text'>OpenSolaris challenges</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Whenever I encounter someone who's overly ecstatic about some new technology or gizmo or something, I quickly say "tell me what's wrong with it".  If they can't think of anything, then I can't trust their opinion.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Nothing is perfect.  There are always tradeoffs to be made.  Only once you are properly informed with the facts, clearly seeing the goods and the bads, minus all the hype, can you finally make a good decision.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you are passionate about something, and you use it day in and day out, then you ought to have a big list of the things that bother you most about it.  Next time you see someone loving their &lt;a href="http://www.apple.com/iphone/"&gt;iPhone&lt;/a&gt;, try asking them what's wrong with it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Unfortunately, hype, "popular opinion", "conventional wisdom", "everybody's doing it", etc. drive so many decisions these days.  Not long ago, when you bought a house, everyone pushed you to choose these newfangled mortgages like ARMs, interest only loans, etc., instead of the boring old-fashioned 30 year fixed rate mortgage.  &lt;a href="http://en.wikipedia.org/wiki/Alan_Greenspan"&gt;Alan Greenspan&lt;/a&gt; was giving speech after speech praising the "innovation" in the financial services industry.  Look where that got us!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I came across this quote recently: "If you find yourself in the majority then it's time to switch sides".  I've been realizing lately how true that is.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So in this spirit of presenting a balanced picture, here are some of the challenges I've hit with &lt;a href="http://opensolaris.org/"&gt;OpenSolaris&lt;/a&gt;:&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;It took practically an Act of God to switch from a dynamically (DHCP assigned) IP address to a static one.  I ran the nice GUI administration tool, made the change, and at first all seemed good.  But then on my next reboot, appparently a bunch of services failed to start.  After much futzing, it was only when I uninstalled &lt;a href="http://www.virtualbox.org/"&gt;VirtualBox&lt;/a&gt; that things finally worked (I think VirtualBox's virtual adapter somehow conflicted).  I now have a static IP!&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;There is apparently no SMART support for SATA drives, which is stunning.  These days, as drives become more and more complex, we need access to their diagnostics. I rely on SMART to monitor the health, temperatures, remapped sectors, etc. of my drives.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;The &lt;a href="http://owfs.org/"&gt;1-wire File System&lt;/a&gt; has not been ported to OpenSolaris.  I have a network of 1-wire devices in my house to monitor temperatures, eg, outdoors, in the kid's bedroom, the attic, etc.  I'm still working on this one... there seem to be some problems talking to libusb.  I may end up simply running a tiny Linux PC (the &lt;a href="http://www.fit-pc2.com/wiki/index.php?title=Main_Page"&gt;Fit PC 2&lt;/a&gt; looks cute) instead, for such random services.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-4527684010889842674?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/4527684010889842674/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/07/whenever-i-encounter-someone-whos.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4527684010889842674'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/4527684010889842674'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/07/whenever-i-encounter-someone-whos.html' title='OpenSolaris challenges'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8623074010562846957.post-9199222953399584631</id><published>2009-07-14T16:13:00.001-04:00</published><updated>2009-07-17T11:23:09.678-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ZFS'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenSolaris'/><title type='text'>Sun's ZFS filesystem</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;I've been test driving Sun's (now Oracle's!) &lt;a href="http://opensolaris.org/os/"&gt;OpenSolaris&lt;/a&gt; (2009.06) and &lt;a href="http://opensolaris.org/os/community/zfs/"&gt;ZFS&lt;/a&gt; filesystem as my home filer and general development machine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'm impressed!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;ZFS provides some incredible features. For example, taking a snapshot of your entire filesystem is wicked fast. This gives you a "point in time" copy of all files that you can keep around for as long as you want.  It's very space efficient because only when a file is changed does the snapshot actually consume disk space (preserving the old copy).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;From the snapshot, which is read-only, you can then make a clone that's read-write.  This effectively lets you fork your filesystem, which is amazing.  Sun builds on this by providing "boot environments", which let you clone your world, boot to it, do all kinds of reckless things, and if you don't like the results, switch back to your current safe world again, no harm done.  I used to leave my home filers pretty much untouched once I started using them for fear of screwing something up.  Now with boot environments I can freely experiment away.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I have a great many Lucene source code checkouts, to try out ideas, apply patches, etc., and by using ZFS's cloning I can now create a new checkout and apply a patch in only a few seconds. And it's very space efficient because only the changed files in the new checkout consume disk space.  Since I'm using an &lt;a href="http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403"&gt;Intel X25 SSD&lt;/a&gt; as my primary storage, space efficiency is important.  The machine uses Intel's Core i7 920 CPU, which has fabulous concurrency and can run the Lucene unit tests 3X faster than my old machine. This all nets out to wonderful productivity gains.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;ZFS also nicely decouples the raw storage device (the "pool"), from filesystems that pull from that storage.  For the secondary storage I set up a &lt;a href="http://blogs.sun.com/bonwick/entry/raid_z"&gt;RAID-Z&lt;/a&gt; pool (like raid5, but fixes the "write hole" problem) using 6 of the &lt;a href="http://www.amazon.com/Western-Digital-Caviar-Green-WD20EADS/dp/B001RB1TIS"&gt;Western Digital Green Caviar 2TB&lt;/a&gt; drives.  Be sure to use the &lt;a href="http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery"&gt;WDTLER utility&lt;/a&gt; if you use these drives in a RAID array.  This gives me 9TB usable space to play with; from here I've created many filesystems that all share the pool.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Performance is excellent: copying a 1TB directory on the RAID-Z pool to another directory on the same pool averages 100 MB/sec.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I also just &lt;a href="http://www.theregister.co.uk/2009/07/13/zfs_deduplication/"&gt;read this morning&lt;/a&gt; that ZFS will add de-duping at the block level, thus making it even more space efficient.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;ZFS can provide these features because it has a write-once core: no block is ever overwritten (unless it was already freed).  Lucene has the same core approach: no file is ever overwritten in the index.  Lucene's transactional semantics derive directly from this as well (though Lucene can't "fork" an index... maybe someday!).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Bye bye Linux, hello Solaris!  I only hope this innovation continues now that Oracle has acquired Sun.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8623074010562846957-9199222953399584631?l=blog.mikemccandless.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.mikemccandless.com/feeds/9199222953399584631/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://blog.mikemccandless.com/2009/07/ive-been-test-driving-suns-now-oracles.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9199222953399584631'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8623074010562846957/posts/default/9199222953399584631'/><link rel='alternate' type='text/html' href='http://blog.mikemccandless.com/2009/07/ive-been-test-driving-suns-now-oracles.html' title='Sun&apos;s ZFS filesystem'/><author><name>Mike McCandless</name><uri>http://www.blogger.com/profile/04277432937861334672</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://2.bp.blogspot.com/_4pUbN9gxhUI/TK2P5yUbqyI/AAAAAAAAACE/wQGlMLfJGt0/S220/mike_head.jpg'/></author><thr:total>0</thr:total></entry></feed>
