Thursday, October 20, 2016

Jirasearch 2.0 dog food: using Lucene to find our Jira issues

A few years ago I first built and released Jirasearch as a fun dog-food test case for the thin-wrapper Lucene server, to expose a powerful search UI over our Jira issues.

This is a great showcase of a number of Lucene's important features:

Curiously, spell correction, or even fuzzy infix suggestions, is still missing (pull requests welcome!).

Since the initial release of Jirasearch it has seen substantial usage and interest from users and developers. Building this and keeping it running all this time has been an awesome and humbling exercise for me because I get to experience life as a "production" user of our software. At the same time, we all get a nice search UI for finding issues.

Upgrading from Lucene 4.6.x to 6.x

For the past week or so I had another similarly humbling experience, this time upgrading Jirasearch from the very-old Lucene 4.6.x release, to the latest 6.x release. Small (yet vital!) things changed, such as the new requirement to use a special index searcher with ToParentBlockJoinQuery, which conflicts with how you must use DrillSideways. I hit this bug in the infix suggester. Something changed about pure negative boolean queries, but I am still not sure what (I have worked around it for now)!

I had already previously upgraded Lucene server to dimensional points so I got that "for free" for the existing numeric fields in Jirasearch.

New Jirasearch features

Besides "merely" upgrading from Lucene 4.6.x to 6.x, and switching all numeric fields to the new dimensional points, I also added some compelling user-visible improvements (thank you to Alexandre Rafalovitch for suggesting some of these, thus kick-starting my unexpectedly challenging upgrade-and-improve effort):

  • cutting@apache.org is finally presented as Doug Cutting! Plus, the auto-suggest now works if you type "Doug".
  • The new Updated ago facet dimension lets you drill down to issues that have not been updated for some time.
  • The new Last comment user facet dimension is the user who last commented on an issue.
  • The new Committed by facet dimension lets you drill down to those issues a given developer has committed changes for.
  • The Committed paths hierarchical facet dimension, letting you find issues according to which paths in the source tree were changed for that issue, was broken since we switched from Subversion to Git.
  • The Infrastructure project issues are now included as well.
  • The per-comment text processing sees some minor improvements, e.g. expanding a referenced user name to their display name, mapping commitbot comment link directly to the change set and including the branch name, plus a few new synonyms (try pnp!)

The new facet fields are especially fun: you can now find issues that you perhaps killed, by drilling down on Updated ago > 1 month ago and Last comment user = you (this was the use case suggested by Alexandre).

Another fun one is to see issues a given developer committed (Committed by) to an unusual part of the source tree (Committed paths), e.g. the issues where I committed changes to Solr for a Lucene Jira issue.

Open source Jirasearch

With this update I am also making all the sources behind jirasearch open-source under the Apache 2 license, in the examples/jirasearch sub-directory of the luceneserver github project.

While Luceneserver itself is entirely Java, the sources for the Jirasearch application, to extract details of all issues from the Apache Jira instance, to convert those documents into Lucene server documents, to do a full and near-real-time indexing, building suggestest, and the search UI, are entirely Python.

Please note the Python sources are not particularly pretty. Yet, they are functional, and as always: patches welcome!

It's likely I broke things during this upgrade process; please let me know (add a comment here, or shoot me an email) if so.

10 comments:

  1. Thanks for maintaining Luceneserver & jirasearch, Mike!

    ReplyDelete
  2. Lot of helpful resources in this article. Thanks a lot.

    One clarification:

    "Faceting, with flat, hierarchical, and dynamic numeric range fields."

    Where did you use hierarchical facet in this page? Committed paths? Do you use taxonomy index?

    If yes, Is it replaceable with flat facets (using sortedsetdocvaluefacetfield) by applying "path traversed terms" in filter?

    ReplyDelete
  3. Hi Kumaran,

    Yes, the Committed Paths is the only hierarchical facet field here, using Lucene's taxonomy facets.

    I'm not sure if you could emulate the hierarchy on top of SSDVFacets, but I agree it would be wonderful if we could improve SSDVFacets to support a hierarchy: patches welcome! Maybe open an issue so we could discuss options?

    ReplyDelete
  4. Hi Michael,
    Trying to build Lucene Server. The process stop at:
    init: cloning lucene branch_6x to ./lucene6x...
    Where I can report this issue?
    Thanks!

    ReplyDelete
    Replies
    1. Hi Leonardo,

      Which Lucene Server sources are you using?

      Mike

      Delete
    2. Hi Mike,
      I already make a pull request in GitHub's mikemccand/luceneserver.
      Please, take a look!

      Delete
    3. Aha, great, I just merged it! Thank you.

      Delete
    4. You are welcome!

      Delete
  5. Hi Mike
    How did you combine relevance calculated by Lucene similarity and recency?
    How much weightage you gave for Lucene relevance score and recency?

    ReplyDelete