Monday, May 13, 2013

Eating dog food with Lucene


Eating your own dog food is important in all walks of life: if you are a chef you should taste your own food; if you are a doctor you should treat yourself when you are sick; if you build houses for a living you should live in a house you built; if you are a parent then try living by the rules that you set for your kids (most parents would fail miserably at this!); and if you build software you should constantly use your own software.

So, for the past few weeks I've been doing exactly that: building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira's search whenever I need to go find an issue.

It's currently running at jirasearch.mikemccandless.com and it's still quite rough (feedback welcome!).

It's a good showcase of a number of Lucene features:

  • Highlighting using the new PostingsHighlighter; for example, try searching for fuzzy query.

  • Autosuggest, using the not-yet-committed AnalyzingInfixSuggester (LUCENE-4845).

  • Sorting by various fields, including a blended recency and relevance sort.

  • A few synonym examples, for example try searching for oome.

  • Near-real-time searching, and controlled searcher versions: the server uses NRTManager, SearcherLifetimeManager and SearcherManager.

  • ToParentBlockJoinQuery: each issue is indexed as a parent document, and then each comment on the issue is indexed as a separate child document. This allows the server to know which specific comment, along with its metadata, was a match for the query, and if you click on that comment (in the highlighted results) it will take you to that comment in Jira. This is very helpful for mega-issues!

  • Okapi BM25 for ranking.

The drill-downs on the left also show a number of features from Lucene's facet module:
  • Drill sideways for all fields, so that the field does not disappear when you drill down on it.

  • Dynamic range faceting: the Updated drill-down is computed dynamically, e.g. all issues updated in the past week.

  • Hierarchical fields, which are simple since the Lucene facet module supports hierarchy natively. Only the Component dimension is hierarchical, e.g. look at the Component drill down for all Lucene core issues.

  • Multi-select faceting (hold down the shift key when clicking on a value), e.g. all improvements and new features.

  • Multi-valued fields (e.g. User, Fix version, Label).



This is really eating two different dog foods: first, as a developer I see what one must go through to build a search server on top of Lucene's APIs, but second, as an end user, I experience the resulting search user interface whenever I need to find a Lucene or Solr issue. It's like having to eat both wet and dry dog food at once, and both kinds of dog food have uncovered numerous issues!

The issues ranged from outright bugs such as PostingsHighlighter picking the worst snippets instead of the best (LUCENE-4826), to missing features like dynamic numeric range facets (LUCENE-4965), to issues that make consuming Lucene's APIs awkward, especially when mixing different features, such as the difficulty of mixing non-range and range facets with DrillSideways (LUCENE-4980) and the difficulty of using NRTManager with both a taxonomy index and a search index (LUCENE-4967), or finally just inefficient, such as the inability to customize how PostingsHighlighter loads its field values (LUCENE-4846).

The process is far from done! There are still a number of issues that need fixing. For example, it's not easy to mix ToParentBlockJoinQuery and grouping, which is frustrating because fields like who reported an issue, severity, issue status, component would all be natural group-by fields. Some issues, such as the inability of PostingsFormatter to render directly to a JSONObject (LUCENE-4896) are still open because they are challenging to fix cleanly. Others, such as the infix suggester (LUCENE-4845) are in limbo because of disagreements on the best approach, while still others, like BlendedComparator used to sort by mixed relevance and recency, I just haven't pushed back into Lucene yet.

There are plenty of ways to provoke an error from the server; here are two fun examples: try fielded search such as summary:python, or a multi-select drilldown on the Updated field.

Much work remains and I'll keep on eating both wet and dry dog food: it's a very productive way to find problems in Lucene!

38 comments:

  1. The synonym example is using an (internal?) IP address.

    Are you using PyLucene to connect with a Lucene instance?

    ReplyDelete
  2. Hi Ivan,

    Woops: I fixed the synonym example link ... thanks.

    I'm made a simple HTTP server (using Netty) to wrap Lucene APIs as JSON, and I access Lucene through that currently.

    ReplyDelete
  3. It would be great if the source code of this app were available. I think lots of Lucene users would benefit from having this code to dig in when looking for examples of how to accomplish certain search features using Lucene.

    ReplyDelete
    Replies
    1. Hi Sudarshan,

      I plan make the sources available! I agree this is useful to see how to use Lucene's features ...

      Delete
    2. +1 for this. It would be a fantastic example for the community.

      Delete
    3. Any news on making sources available?

      Delete
    4. I'm curious too. Is this going to be made available?

      Delete
  4. Hey Mike, this is very cool and super fast! I'm curious about the suggester, how/when do you rebuild it? Do you do it every X hours or maybe something more sophisticated?

    ReplyDelete
    Replies
    1. Hi Adrien,

      I rebuild the suggester every 5 minutes right now ... it would be nice to have a more NRT-friendly approach :)

      Actually, because I'm using AnalyzingInfixSuggester (which is basically just an index and lookups are just searches), it would be possible to do this ... but I haven't explored it yet.

      Delete
    2. Asking in 2014 :) .. Any alternatives to AnalyzingInfixSuggester that is NRT friendly? I see the suggester is rebuilding index and adding it into original index. Not sure I follow Lucene 4.x code.

      Delete
    3. I don't think any of Lucene's current suggesters are NRT friendly.

      But, AnalyzingInfixSuggester could fairly easily be made NRT friendly: it's just a Lucene index under-the-hood. But nobody has made a patch/issue for this yet ...

      Delete
  5. Very, very nice. I'm a big supporter of the "eat your own dog food" mindset. It's the best approach to uncovering issues or finding potential new features.

    Presumably the source in patches or related commit logs could provide a way of tying the Java package name structures in as a hierarchical facet?

    ReplyDelete
    Replies
    1. Hi Mark,

      That's a VERY cool idea, linking in package name hierarchy from the issues via commit logs. This would allow for wild explorations, like seeing which parts of the code base have had the most bugs over time, etc.

      Delete
    2. A UI model for navigating the package hierarchy facet could be the sort of drill-down pie chart "JDiskReport" offers for showing which directories on your file system are chewing up all the disk space. Each slice of the pie chart can be clicked to take you down a further level in the hierarchy.

      Delete
    3. That sounds great! I'll probably start with "normal" drill down and then try to find a pie-drilldown-widget to switch to.

      Delete
    4. Hi Mark,

      I just pushed a change to do a BASIC integration with svn commit logs: it just adds a normal drill-down, Committed Paths. It's a start :) And sometimes there are way too many entries under one path ...

      Delete
    5. Nice work, Mike.
      I see what you mean about a lot of facet counts. Also, sometimes the search term (e.g. "queryparser") can be very loosely connected to a single issue e.g. it only appears fleetingly as part of an attached build log e.g. https://issues.apache.org/jira/browse/LUCENE-3887

      In situations like that, maybe it's best to steer users to the related source code areas with facet *scores* based on something like Jaccard cooefficient rather than absolute counts. This would effectively rank the facets by the mix of related results vs corpus. As an example, "core" generally sees 3 times more action than "contrib" but a search for "highlighter" that yields equals counts for "contrib" and "core" categories should steer users towards "contrib" first due to the statistical skew in the background results.
      Obviously the Jaccard scores might not mean much to display but can help in sorting/sizing the facet suggestions.
      Do you have JSON/REST apis for the backend? Would be nice if the UI was decoupled to allow for alternative front ends.

      Delete
    6. Hi Mark,

      The search server is a basic JSON/REST Netty wrapper around Lucene, and the UI server (separate process) is Python running under Apache using mod_wsgi.

      Jaccard coefficient is a great idea! Hmm but I don't know how to easily compute that w/ the facets module. Maybe as a first step I could aggregate the relevance against each facet label and sort by that (facets module can do this, but I need to expose in the server...).

      Separately I fixed the svn paths so that lucene/contrib/X now maps to lucene/X (I try to "canonicalize" the paths), and I noticed the highlights for "queryparser" were absurdly big so I trimmed them down, and also I was failing to search the issue key itself... I just pushed fixes for these.

      Delete
    7. Aggregating relevance sounds like it could be a useful approach. Perhaps experimenting with averaging rather than summing hit scores per facet category? Score averages would avoid the problem with summing of popular categories that always achieve high scores simply by virtue of the many low-grade matches that are likely to be found in there. Of course averaging *all* results in a very large facet category may serve to adversely dilute scores of the few very good hits that may fall in there.

      It's a messy business trying to adequately summarize "fuzzy" sets :)

      Delete
  6. Very nice and useful too.

    Would be nice to have Author field in the search. I frequently do "my issues that mention X". Putting in my name and issue keyword would be great. Currently, it picks up my name but only in comments.

    ReplyDelete
    Replies
    1. Hi Alexandre,

      You can do this by drilling down on Reporter (all the way near the bottom!), but I agree that's not very convenient.

      I'll try to index reporter/assigned/commenter names as text as well...

      Delete
    2. Hi Alexandre,

      I just pushed a change to index User (= reporter, assigned, commented) as a text field for searching, so if you search by your name in the text box, it should find hits.

      Delete
  7. Hi,

    Awesome Idea. I'm inspired to make an application myself :) Could you point me to a good resource where I could use Netty to accept JSON, parse it to consume the Lucene API's. Till you publish your code I could get started.

    Also I had a couple of questions.
    I wanted to know how did you build weights for the AnalyzingInfixSuggester? Did you use issue priority /open closed?

    You have used ToParentBlockJoinQuery for indexing comments. If instead we used a multivalued field what would be different?


    ReplyDelete
    Replies
    1. Hi Varun,

      I just started w/ Netty's javadocs and Google searches to build the Lucene server wrapper ... I'll try to post this soon.

      For the suggestions I weight entirely by recency.

      ToParentBJQ in general would let you do query constraints at the comment level, eg find all issues where you said xyz in a comment, but I'm not exposing that now. It also lets me know all comment metadata for the comments that matched the search, so I can link to that comment in the highlights, vs multi-valued field where the postings are all concatenated.

      Delete
  8. The speed is amazing! Quite interesting if we can use this to search other apache issues :)

    ReplyDelete
  9. Very cool! Waiting for a day this would be open sourced. If not the UI, at least the netty wrapper around lucene.

    -Ramesh

    ReplyDelete
  10. I suggest you sell it to http://www.thoughtworks-studios.com/mingle-agile-project-management it will be really good business.

    ReplyDelete
  11. Hi-- This is great stuff. I've made an adapted version that glues the suggestion onto the end of the search passed in (after truncating whatever suffix of the search is a prefix of the suggestion).

    I was wondering about the way you open your readers in the build() method.

    In particular, starting at line 257 (of the code from lucene 4.4.0), we have this:

    r = new SlowCompositeReaderWrapper(DirectoryReader.open(w, false));
    //long t1 = System.nanoTime();
    w.rollback();

    final int maxDoc = r.maxDoc();

    In my naivete, it would have tried something along the lines of

    w.close();
    r = new SlowCompositeReaderWrapper(DirectoryReader.open(dirTmp));

    Where dirTmp is the directory defined above on line 189.

    Down below, after you create the second indexwriter that you fill from this sorted reader, you do a similar thing when you open the searcher.

    Is this opening a reader from a writer then closing the writer something I should be doing? I tried going through the code a bit, but I still haven't gotten comfortable enough in the innards to get very far very quickly.


    ReplyDelete
    Replies
    1. Hi Michael,

      Pulling a reader directly the from the writer (this is called a "near real time" reader) is always an option, if you want to avoid the cost of close (= commit) in the latency of opening the reader. In AnalyzingInfixSuggester, for the first writer, I want to avoid commit entirely because this index is just temporary, so after I open the NRT reader I then rollback the writer. In the second case I could have just closed the writer and opened a reader from the Directory instead ... there's no real benefit of using an NRT reader there.

      Delete
  12. awesome, is there a way to have a look to the source code?
    learning by studying the source code would help me a lot.
    is that on github or something

    ReplyDelete
    Replies
    1. Hi, I've posted the source code behind the server onto this Lucene issue:

      https://issues.apache.org/jira/browse/LUCENE-5376

      The goal is to make these sources available as a "demo" server for Lucene.

      Delete
    2. wau, that is great. thank you : )

      Delete
  13. Mike, how did you retrieve key value pair kind of results in the auto complete response. Do you add more than one field to the index when you build the suggestion dictionary?
    http://jirasearch.mikemccandless.com/suggest.py?term=6025&source=titles_infix&index=jira&contexts=

    [{"volLink": "http://issues.apache.org/jira/browse/LUCENE-6025", "label": "LUCENE-6025: Add BitSet.prevSetBit"}, {"volLink": "http://issues.apache.org/jira/browse/SOLR-6025", "label": "SOLR-6025: StreamingUpdateSolrServer is mentioned in various schema.xml files"}]

    ReplyDelete
  14. Another example is
    http://jirasearch.mikemccandless.com/suggest.py?term=candles&source=titles_infix&index=jira&contexts=

    [{"user": "Michael McCandless", "label": "user: \"Michael McCandless\""}]

    ReplyDelete
    Replies
    1. Hi Shyamsunder,

      I use AnalyzingInfixSuggester here, and I uses both its "context" feature, to only show suggestions for the project you've drilled into, and its "payload" feature, to hold the metadata behind each suggestion (issue URL for an issue, username for a user).

      In the search server behind the scenes, the suggester is built separately from what fields were indexed into documents (unlike Solr, Elasticsearch, I think), so I'm indexing different sources (issues, usernames, project names) into a single suggester, and on the front end I have logic to look at each suggestion and render it differently.

      Delete