Changing Bits: Jirasearch 2.0 dog food: using Lucene to find our Jira issues

A few years ago I first built and released Jirasearch as a fun dog-food test case for the thin-wrapper Lucene server, to expose a powerful search UI over our Jira issues.

This is a great showcase of a number of Lucene's important features:

Using block join queries to model parent (the original Jira issue) and children (each comment) documents. This basic relational structure is also common in e-commerce applications, where you have a product (e.g. a specific shirt) and then individual SKUs (size/color combinations) under that shirt
Highlighting with PostingsHighlighter
Faceting, with flat, hierarchical, and dynamic numeric range fields. Remember you can pick multiple facet values (multi-select) with shift+click!
DrillSideways facet counts, so you don't lose facet counts of other labels just because you drilled down to one of them
AnalyzingInfixSuggester for auto-suggest, including near-real-time updates. Suggestions are project specific: if you have drilled down to specific project/s, then the suggestions will only be from those projects, thanks to AnalyzingInfixSuggester now supporting contexts
Near real time indexing and searching
WordDelimiterFilter so camel case tokens are split (try searching for infix)
Synonyms
Using expressions to dynamically compute a blend of recency and relevance for the sort order score for hits

Curiously, spell correction, or even fuzzy infix suggestions, is still missing (pull requests welcome!).

Since the initial release of Jirasearch it has seen substantial usage and interest from users and developers. Building this and keeping it running all this time has been an awesome and humbling exercise for me because I get to experience life as a "production" user of our software. At the same time, we all get a nice search UI for finding issues.

Upgrading from Lucene 4.6.x to 6.x

For the past week or so I had another similarly humbling experience, this time upgrading Jirasearch from the very-old Lucene 4.6.x release, to the latest 6.x release. Small (yet vital!) things changed, such as the new requirement to use a special index searcher with ToParentBlockJoinQuery, which conflicts with how you must use DrillSideways. I hit this bug in the infix suggester. Something changed about pure negative boolean queries, but I am still not sure what (I have worked around it for now)!

I had already previously upgraded Lucene server to dimensional points so I got that "for free" for the existing numeric fields in Jirasearch.

New Jirasearch features

Besides "merely" upgrading from Lucene 4.6.x to 6.x, and switching all numeric fields to the new dimensional points, I also added some compelling user-visible improvements (thank you to Alexandre Rafalovitch for suggesting some of these, thus kick-starting my unexpectedly challenging upgrade-and-improve effort):

cutting@apache.org is finally presented as Doug Cutting! Plus, the auto-suggest now works if you type "Doug".
The new Updated ago facet dimension lets you drill down to issues that have not been updated for some time.
The new Last comment user facet dimension is the user who last commented on an issue.
The new Committed by facet dimension lets you drill down to those issues a given developer has committed changes for.
The Committed paths hierarchical facet dimension, letting you find issues according to which paths in the source tree were changed for that issue, was broken since we switched from Subversion to Git.
The Infrastructure project issues are now included as well.
The per-comment text processing sees some minor improvements, e.g. expanding a referenced user name to their display name, mapping commitbot comment link directly to the change set and including the branch name, plus a few new synonyms (try pnp!)

The new facet fields are especially fun: you can now find issues that you perhaps killed, by drilling down on Updated ago > 1 month ago and Last comment user = you (this was the use case suggested by Alexandre).

Another fun one is to see issues a given developer committed (Committed by) to an unusual part of the source tree (Committed paths), e.g. the issues where I committed changes to Solr for a Lucene Jira issue.

Open source Jirasearch

With this update I am also making all the sources behind jirasearch open-source under the Apache 2 license, in the examples/jirasearch sub-directory of the luceneserver github project.

While Luceneserver itself is entirely Java, the sources for the Jirasearch application, to extract details of all issues from the Apache Jira instance, to convert those documents into Lucene server documents, to do a full and near-real-time indexing, building suggestest, and the search UI, are entirely Python.

Please note the Python sources are not particularly pretty. Yet, they are functional, and as always: patches welcome!

It's likely I broke things during this upgrade process; please let me know (add a comment here, or shoot me an email) if so.

10 comments:

David SmileyOctober 21, 2016 at 8:25 AM
Thanks for maintaining Luceneserver & jirasearch, Mike!
Kumaran ROctober 24, 2016 at 3:58 AM
Lot of helpful resources in this article. Thanks a lot.

One clarification:

"Faceting, with flat, hierarchical, and dynamic numeric range fields."

Where did you use hierarchical facet in this page? Committed paths? Do you use taxonomy index?

If yes, Is it replaceable with flat facets (using sortedsetdocvaluefacetfield) by applying "path traversed terms" in filter?
Michael McCandlessOctober 24, 2016 at 11:58 AM
Hi Kumaran,

Yes, the Committed Paths is the only hierarchical facet field here, using Lucene's taxonomy facets.

I'm not sure if you could emulate the hierarchy on top of SSDVFacets, but I agree it would be wonderful if we could improve SSDVFacets to support a hierarchy: patches welcome! Maybe open an issue so we could discuss options?
AnonymousAugust 18, 2017 at 4:57 PM
Hi Michael,
Trying to build Lucene Server. The process stop at:
init: cloning lucene branch_6x to ./lucene6x...
Where I can report this issue?
Thanks!
AnonymousFebruary 1, 2018 at 1:24 AM
Hi Mike
How did you combine relevance calculated by Lucene similarity and recency?
How much weightage you gave for Lucene relevance score and recency?

Thursday, October 20, 2016

Jirasearch 2.0 dog food: using Lucene to find our Jira issues

Upgrading from Lucene 4.6.x to 6.x

New Jirasearch features

Open source Jirasearch

10 comments: