Wednesday, March 5, 2014

Using Lucene's search server to search Jira issues

You may remember my first blog post describing how the Lucene developers eat our own dog food by using a Lucene search application to find our Jira issues.

That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene's capabilities, I point them to this application so they can see for themselves.

Recently, I've made some further progress so I want to give an update.

The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I've been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene's current modules in a server context with minimal "thin server" additional source code.

Separately, to test this new Lucene based server, and to complete the "dog food," I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira's REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.

I just pushed some further changes to the production site:
  • I upgraded the Jira search application to the current server branch (previously it was running on my private fork).

  • I switched all analysis components to Lucene's analysis factories; these factories use Java's SPI (Service Provider Interface) so that the server has access to any char filters, tokenizers and token filters in the classpath. This is very helpful when building a server because it means you don't need any special code to handle the great many analysis components that Lucene provides these days. Everything simply passes through the factories (which know how to parse their own arguments).

  • I've added the Tika project, so you can now find Tika issues as well. This was very simple to add, and seems be working!

  • I inserted WordDelimiterFilter so that CamelCaseTokens are split. For example, try searching on infix and note the highlights. As Rober Muir reminded me, WordDelimiterFilter corrupts offsets, which will mess up highlighting in some cases, so I'm going to try to set up ICUTokenizer, which I'm already using, to do this splitting instead.

  • I switched to Lucene's new expressions module to do blended relevance + recency sort by default when you do a text search, which is helpful because most of the time we are looking for recently touched issues. Previously I used a custom FieldComparator to achieve the same functionality, but expressions is more compact and powerful and lets me remove that custom FieldComparator.

  • I switched to near-real-time building of the suggestions, using AnalyzingInfixSuggester. Previously I was fully rebuilding the suggester every five minutes, so this saves a lot of CPU since now I just add new Jira issues as they come, and refresh the suggester. It also means a much shorter delay from when an index is added to when it can be suggested. See LUCENE-5477 for details.

  • I now commit once per day. Previously I never committed, and simply relied on near-real-time searching. This works just fine, except when I need to bring the server down (e.g. to push new changes out), it required full reindexing, which was very fast but a poor user experience for those users who happened to do a search while it was happening. Now, when I bounce the server it comes back to the last commit and then the near-real-time indexing quickly catches up on any changed issues since that last commit.

  • Various small issues, such as proper handling when a Jira issue is renamed (the Jira REST API does not make it so easy to discover this!); better production push automation; upgraded to a newer version of bootstrap UI library.

There are still plenty of improvements to make to this Jira search application. For fields with many possible drill-down values, I'd like to have a simple suggester so the user can quickly drill down. I'd like to fix the suggester to filter suggestions according to the project. For example, if you've drilled down into Tika issues, then when you type a new search you should see only Tika issues suggested. For that we need to make AnalzyingInfixSuggester context aware. I'd also like a more compact UI for all of the facet fields; maybe I need to hide the less commonly used facet fields under a "More"...

Please send me any feedback / problems when you're searching for issues!

2 comments:

  1. Hi sir,

    I have a client having 90 lakh of records in 20 cores. Is it possible to search across 20 cores at a time by using IndexSearcher? Or are there any efficient way of doing this??? Any help is appreciated...

    ReplyDelete
    Replies
    1. Hi balaji,

      You should probably use ElasticSearch or Solr? They handle this scale out for you.

      Delete