That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene's capabilities, I point them to this application so they can see for themselves.
Recently, I've made some further progress so I want to give an update.
The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I've been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene's current modules in a server context with minimal "thin server" additional source code.
Separately, to test this new Lucene based server, and to complete the "dog food," I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira's REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.
I just pushed some further changes to the production site:
- I upgraded the Jira search application to the current server
branch (previously it was running on my private fork).
- I switched all analysis components to Lucene's analysis factories;
these factories
use Java's
SPI (Service Provider Interface) so that the server has access
to any char filters, tokenizers and token filters in the classpath.
This is very helpful when building a server because it means you
don't need any special code to handle the great many analysis
components that Lucene provides these days. Everything simply
passes through the factories (which know how to parse their own
arguments).
- I've added the Tika project,
so you can now find Tika issues as well. This was very simple to
add, and seems be working!
- I inserted
WordDelimiterFilter
so that CamelCaseTokens are split. For example, try searching on infix and note the highlights. As Rober Muir reminded me,WordDelimiterFilter
corrupts offsets, which will mess up highlighting in some cases, so I'm going to try to set upICUTokenizer
, which I'm already using, to do this splitting instead.
- I switched to Lucene's new expressions module to do blended relevance +
recency sort by default when you do a text search, which is helpful
because most of the time we are looking for recently touched issues.
Previously I used a custom
FieldComparator
to achieve the same functionality, but expressions is more compact and powerful and lets me remove that customFieldComparator
.
- I switched to near-real-time building of the suggestions, using
AnalyzingInfixSuggester.
Previously I was fully rebuilding the suggester every five minutes,
so this saves a lot of CPU since now I just add new Jira issues as
they come, and refresh the suggester. It also means a much shorter delay from when an index is added to when it can be suggested.
See LUCENE-5477
for details.
- I now
commit
once per day. Previously I never committed, and simply relied on near-real-time searching. This works just fine, except when I need to bring the server down (e.g. to push new changes out), it required full reindexing, which was very fast but a poor user experience for those users who happened to do a search while it was happening. Now, when I bounce the server it comes back to the last commit and then the near-real-time indexing quickly catches up on any changed issues since that last commit.
- Various small issues, such as proper handling when a Jira issue
is renamed (the Jira REST API does not make it so easy to discover
this!); better production push automation; upgraded to a newer
version of bootstrap UI
library.
Please send me any feedback / problems when you're searching for issues!
Hi sir,
ReplyDeleteI have a client having 90 lakh of records in 20 cores. Is it possible to search across 20 cores at a time by using IndexSearcher? Or are there any efficient way of doing this??? Any help is appreciated...
Hi balaji,
DeleteYou should probably use ElasticSearch or Solr? They handle this scale out for you.