Changing Bits: 2014

Monday, November 17, 2014

Apache Lucene™ 5.0.0 is coming!

At long last, after a strong series of 4.x feature releases, most recently 4.10.2, we are finally working towards another major Apache Lucene release!

There are no promises for the exact timing (it's done when it's done!), but we already have a volunteer release manager (thank you Anshum!).

A major release in Lucene means all deprecated APIs (as of 4.10.x) are dropped, support for 3.x indices is removed while the numerous 4.x index formats are still supported for index backwards compatibility, and the 4.10.x branch becomes our bug-fix only release series (no new features, no API changes).

5.0.0 already contains a number of exciting changes, which I describe below, and they are still rolling in with ongoing active development.

Stronger index safety

Many of the 5.0.0 changes are focused on providing stronger protection against index corruption.

All file access now uses Java's NIO.2 APIs, giving us better error handling (e.g., Files.delete returns a meaningful exception) along with atomic rename for safer commits, reducing the risk of hideous "your entire index is gone" bugs like this doozie.

Lucene's replication module, along with distributed servers on top of Lucene such as Elasticsearch or Solr, must copy index files from one place to another. They do this for backup purposes (e.g., snapshot and restore), for migrating or recovering a shard from one node to another or when adding a new replica. Such replicators try to be incremental, so that if the same file name is present, with the same length and checksum, it will not be copied again.

Unfortunately, these layers sometimes have subtle bugs (they are complex!). Thanks to checksums (added in 4.8.0), Lucene already detects if the replicator caused any bit-flips while copying, and this revealed a long standing nasty bug in the compression library Elasticsearch uses.

With 5.0.0 we take this even further and now detect if whole files were copied to the wrong file name, by assigning a unique id to every segment and commit (segments_N file). Each index file now records the segment id in its header, and then these ids are cross-checked when the index is opened.

The new Lucene50Codec also includes further index corruption detection.

Even CorruptIndexException itself is improved! It will now always refer to the file or resource where the corruption was detected, as this is now a required argument to its constructors. When corruption is detected higher up (e.g., a bad field number in the field infos file), the resulting CorruptIndexException will now state whether there was also a checksum mismatch in the file, helping to narrow the possible source of the corruption.

Finally, during merge, IndexWriter now always checks the incoming segments for corruption before merging. This can mean, on upgrading to 5.0.0, that merging may uncover long-standing latent corruption in an older 4.x index.

Reduced heap usage

5.0.0 also includes several changes to reduce heap usage during indexing and searching.

If your index has 1B docs, then caching a single FixedBitSet-based filter in 4.10.2 costs a non-trivial 125 MB of heap! But with 5.0.0, Lucene now supports random-writable and advance-able sparse bitsets (RoaringDocIdSet and SparseFixedBitSet), so the heap required is in proportion to how many bits are set, not how many total documents exist in the index. These bitsets also greatly simplify how MultiTermQuery is rewritten (no more CONSTANT_SCORE_AUTO_REWRITE_METHOD), and they provide faster advance implementations than FixedBitSet's linear scan. Finally, they provide a more accurate cost() implementation, allowing Lucene to make better choices about how to drive the intersection at query time.

Heap usage during IndexWriter merging is also much lower with the new Lucene50Codec, since doc values and norms for the segments being merged are no longer fully loaded into heap for all fields; now they are loaded for the one field currently being merged, and then dropped.

The default norms format now uses sparse encoding when appropriate, so indices that enable norms for many sparse fields will see a large reduction in required heap at search time.

An explain API for heap usage

If you still find Lucene using more heap than you expected, 5.0.0 has a new API to print a tree structure showing a recursive breakdown of which parts are using how much heap. This is analogous to Lucene's explain API, used to understand why a document has a certain relevance score, but applied to heap usage instead.

It produces output like this:

_cz(5.0.0):C8330469: 28MB 
  postings [...]: 5.2MB 
    ... 
    field 'latitude' [...]: 678.5KB 
      term index [FST(nodes=6679, ...)]: 678.3KB

This is a much faster way to see what is using up your heap than trying to stare at a Java heap dump.

Further changes

There is a long tail of additional 5.0.0 changes; here are some of them:

Old experimental postings formats (Sep/Fixed/VariableIntPostingsFormat) have been removed. PulsingPostingsFormat has also been removed, since the default postings format already pulses unique terms.
FieldCache is gone (moved to a dedicated UninvertingReader in the misc module). This means when you intend to sort on a field, you should index that field using doc values, which is much faster and less heap consuming than FieldCache.
Tokenizers and Analyzers no longer require Reader on init.
NormsFormat now gets its own dedicated NormsConsumer/Producer
Simplifications to FieldInfo (Lucene's "low schema"): no more normType (it is always a DocValuesType.NUMERIC), no more isIndexed (just check IndexOptions)
Compound file handling is simpler, and is now under codec control.
SortedSetSortField, used to sort on a multi-valued field, is promoted from sandbox to Lucene's core
PostingsFormat now uses a "pull" API when writing postings, just like doc values. This is powerful because you can do things in your postings format that require making more than one pass through the postings such as iterating over all postings for each term to decide which compression format it should use.
Version is no longer required on init to classes like IndexWriterConfig and analysis components.

The changes I've described here are just a snapshot of what we have lined up today for a 5.0.0 release. 5.0.0 is still under active development (patches welcome!) so this list will change by the time the actual release is done.

Saturday, August 30, 2014

Scoring tennis using finite-state automata

For some reason having to do with the medieval French, the scoring system for tennis is very strange.

In actuality, the game is easy to explain: to win, you must score at least 4 points and win by at least 2. Yet in practice, you are supposed to use strange labels like "love" (0 points), "15" (1 point), "30" (2 points), "40" (3 points), "deuce" (3 or more points each, and the players are tied), "all" (players are tied) instead of simply tracking points as numbers, as other sports do.

This is of course wildly confusing to newcomers. Fortunately, the convoluted logic is easy to express as a finite-state automaton (FSA):

The game begins in the left-most (unlabeled) state, and then each time either player 1 (red) or player 2 (blue) scores, you advance to the corresponding state to know how to say the score properly in tennis-speak. In each state, player 1's score is first followed by player 2's; for example "40 30" means player 1 has scored 3 points and player 2 has scored 2 and "15 all" means both players have scored once. "adv 2" means player 2 is ahead by 1 point and will win if s/he scores again.

There are only 20 states, and there are cycles which means a tennis game can in fact go on indefinitely, if the players pass back and forth through the "deuce" (translation: game is tied) state.

This FSA is correct, and if you watch a Wimbledon match, for example, you'll see the game advance through precisely these states.

Minimization

Yet for an FSA, merely being correct is not good enough!

It should also strive to be minimal, and surprisingly this FSA is not: if you build this Automaton in Lucene and minimize it, you'll discover that there are some wasted states! This means 20 states is overkill when deciding who won the game.

Specifically, there is no difference between the "30 all" and "deuce" states, nor between the "30 40" and "adv 2" states, nor between the "40 30" and "adv 1" states. From either state in each of these pairs, there is no sequence of player 1 / player 2 scoring that will result in a different final outcome (this is in principle how the minimization process identifies indistinguishable states).

Therefore, there's no point in keeping those states, and you can safely use this smaller 17-state FSA (15% smaller!) to score your tennis games instead:

For example, from "15 30", if player 1 scores, you go straight to "deuce" and don't bother with the redundant "30 30" state.

Another (simpler?) way to understand why these states are wasted is to recognize that the finite state machine is tracking two different pieces of information: first, how many points ahead player 1 is (since a player must win by 2 points) and second, how many points have been scored (since a player must score at least 4 points to win).

Once enough points (4 or more) have been scored by either player, their absolute scores no longer matter. All that matters is the relative score: whether player 1 is ahead by 1, equal, or behind by 1. For example, we don't care if the score is 197 to 196 or 6 to 5: they are the same thing.

Yet, early on, the FSA must also track the absolute scores, to ensure at least 4 points were scored by the winner. With the original 20-state FSA, the crossover between these two phases was what would have been "40 40" (each player scored 3 points). But in the minimal machine, the crossover became "30 30" (each player scored 2 points), which is safe since each player must still "win by 2" so if player 1 scores 2 points from "30 30", that means player 1 scored 4 points overall.

FSA minimization saved only 3 states for the game of tennis, resulting in a 15% smaller automaton, and maybe this simplifies keeping track of scores in your games by a bit, but in other FSA applications in Lucene, such as the analyzing suggester, MemoryPostingsFormat and the terms index, minimization is vital since it saves substantial disk and RAM for Lucene applications!

Sunday, August 3, 2014

A new proximity query for Lucene, using automatons

The simplest Apache Lucene query, TermQuery, matches any document that contains the specified term, regardless of where the term occurs inside each document. Using BooleanQuery you can combine multiple TermQuerys, with full control over which terms are optional (SHOULD) and which are required (MUST) or required not to be present (MUST_NOT), but still the matching ignores the relative positions of each term inside the document.

Sometimes you do care about the positions of the terms, and for such cases Lucene has various so-called proximity queries.

The simplest proximity query is PhraseQuery, to match a specific sequence of tokens such as "Barack Obama". Seen as a graph, a PhraseQuery is a simple linear chain:

By default the phrase must precisely match, but if you set a non-zero slop factor, a document can still match even when the tokens are not exactly in sequence, as long as the edit distance is within the specified slop. For example, "Barack Obama" with a slop factor of 1 will also match a document containing "Barack Hussein Obama" or "Barack H. Obama". It looks like this graph:

Now there are multiple paths through the graph, including an any (*) transition to match an arbitrary token. (Note: while the graph cannot properly express it, this query would also match a document that had the tokens Barack and Obama on top of one another, at the same position, which is a little bit strange!)

In general, proximity queries are more costly on both CPU and IO resources, since they must load, decode and visit another dimension (positions) for each potential document hit. That said, for exact (no slop) matches, using common-grams, shingles and ngrams to index additional "proximity terms" in the index can provide enormous performance improvements in some cases, at the expense of an increase in index size.

MultiPhraseQuery is another proximity query. It generalizes PhraseQuery by allowing more than one token at each position, for example:

This matches any document containing either domain name system or domain name service. MultiPhraseQuery also accepts a slop factor to allow for non-precise matches.

Finally, span queries (e.g. SpanNearQuery, SpanFirstQuery) go even further, allowing you to build up a complex compound query based on positions where each clause matched. What makes them unique is that you can arbitrarily nest them. For example, you could first build a SpanNearQuery matching Barack Obama with slop=1, then another one matching George Bush, and then make another SpanNearQuery, containing both of those as sub-clauses, matching if they appear within 10 terms of one another.

Introducing TermAutomatonQuery

As of Lucene 4.10 there will be a new proximity query to further generalize on MultiPhraseQuery and the span queries: it allows you to directly build an arbitrary automaton expressing how the terms must occur in sequence, including any transitions to handle slop. Here's an example:

This is a very expert query, allowing you fine control over exactly what sequence of tokens constitutes a match. You build the automaton state-by-state and transition-by-transition, including explicitly adding any transitions (sorry, no QueryParser support yet, patches welcome!). Once that's done, the query determinizes the automaton and then uses the same infrastructure (e.g. CompiledAutomaton) that queries like FuzzyQuery use for fast term matching, but applied to term positions instead of term bytes. The query is naively scored like a phrase query, which may not be ideal in some cases.

In addition to this new query there is also a simple utility class, TokenStreamToTermAutomatonQuery, that provides loss-less translation of any graph TokenStream into the equivalent TermAutomatonQuery. This is powerful because it means even arbitrary token stream graphs will be correctly represented at search time, preserving the PositionLengthAttribute that some tokenizers now set.

While this means you can finally correctly apply arbitrary token stream graph synonyms at query-time, because the index still does not store PositionLengthAttribute, index-time synonyms are still not fully correct. That said, it would be simple to build a TokenFilter that writes the position length into a payload, and then to extend the new TermAutomatonQuery to read from the payload and apply that length during matching (patches welcome!).

The query is likely quite slow, because it assumes every term is optional; in many cases it would be easy to determine required terms (e.g. Obama in the above example) and optimize such cases. In the case where the query was derived from a token stream, so that it has no cycles and does not use any transitions, it may be faster to enumerate all phrases accepted by the automaton (Lucene already has the getFiniteStrings API to do this for any automaton) and construct a boolean query from those phrase queries. This would match the same set of documents, also correctly preserving PositionLengthAttribute, but would assign different scores.

The code is very new and there are surely some exciting bugs! But it should be a nice start for any application that needs precise control over where terms occur inside documents.

Monday, May 12, 2014

Choosing a fast unique identifier (UUID) for Lucene

Most search applications using Apache Lucene assign a unique id, or primary key, to each indexed document. While Lucene itself does not require this (it could care less!), the application usually needs it to later replace, delete or retrieve that one document by its external id. Most servers built on top of Lucene, such as Elasticsearch and Solr, require a unique id and can auto-generate one if you do not provide it.

Sometimes your id values are already pre-defined, for example if an external database or content management system assigned one, or if you must use a URI, but if you are free to assign your own ids then what works best for Lucene?

One obvious choice is Java's UUID class, which generates version 4 universally unique identifiers, but it turns out this is the worst choice for performance: it is 4X slower than the fastest. To understand why requires some understanding of how Lucene finds terms.

BlockTree terms dictionary

The purpose of the terms dictionary is to store all unique terms seen during indexing, and map each term to its metadata (docFreq, totalTermFreq, etc.), as well as the postings (documents, offsets, postings and payloads). When a term is requested, the terms dictionary must locate it in the on-disk index and return its metadata.

The default codec uses the BlockTree terms dictionary, which stores all terms for each field in sorted binary order, and assigns the terms into blocks sharing a common prefix. Each block contains between 25 and 48 terms by default. It uses an in-memory prefix-trie index structure (an FST) to quickly map each prefix to the corresponding on-disk block, and on lookup it first checks the index based on the requested term's prefix, and then seeks to the appropriate on-disk block and scans to find the term.

In certain cases, when the terms in a segment have a predictable pattern, the terms index can know that the requested term cannot exist on-disk. This fast-match test can be a sizable performance gain especially when the index is cold (the pages are not cached by the the OS's IO cache) since it avoids a costly disk-seek. As Lucene is segment-based, a single id lookup must visit each segment until it finds a match, so quickly ruling out one or more segments can be a big win. It is also vital to keep your segment counts as low as possible!

Given this, fully random ids (like UUID V4) should perform worst, because they defeat the terms index fast-match test and require a disk seek for every segment. Ids with a predictable per-segment pattern, such as sequentially assigned values, or a timestamp, should perform best as they will maximize the gains from the terms index fast-match test.

Testing Performance

I created a simple performance tester to verify this; the full source code is here. The test first indexes 100 million ids into an index with 7/7/8 segment structure (7 big segments, 7 medium segments, 8 small segments), and then searches for a random subset of 2 million of the IDs, recording the best time of 5 runs. I used Java 1.7.0_55, on Ubuntu 14.04, with a 3.5 GHz Ivy Bridge Core i7 3770K.

Since Lucene's terms are now fully binary as of 4.0, the most compact way to store any value is in binary form where all 256 values of every byte are used. A 128-bit id value then requires 16 bytes.

I tested the following identifier sources:

Sequential IDs (0, 1, 2, ...), binary encoded.
Zero-padded sequential IDs (00000000, 00000001, ...), binary encoded.
Nanotime, binary encoded. But remember that nanotime is tricky.
UUID V1, derived from a timestamp, nodeID and sequence counter, using this implementation.
UUID V4, randomly generated using Java's UUID.randomUUID().
Flake IDs, using this implementation.

For the UUIDs and Flake IDs I also tested binary encoding in addition to their standard (base 16 or 36) encoding. Note that I only tested lookup speed using one thread, but the results should scale linearly (on sufficiently concurrent hardware) as you add threads.

Zero-padded sequential ids, encoded in binary are fastest, quite a bit faster than non-zero-padded sequential ids. UUID V4 (using Java's UUID.randomUUID()) is ~4X slower.

But for most applications, sequential ids are not practical. The 2nd fastest is UUID V1, encoded in binary. I was surprised this is so much faster than Flake IDs since Flake IDs use the same raw sources of information (time, node id, sequence) but shuffle the bits differently to preserve total ordering. I suspect the problem is the number of common leading digits that must be traversed in a Flake ID before you get to digits that differ across documents, since the high order bits of the 64-bit timestamp come first, whereas UUID V1 places the low order bits of the 64-bit timestamp first. Perhaps the terms index should optimize the case when all terms in one field share a common prefix.

I also separately tested varying the base from 10, 16, 36, 64, 256 and in general for the non-random ids, higher bases are faster. I was pleasantly surprised by this because I expected a base matching the BlockTree block size (25 to 48) would be best.

There are some important caveats to this test (patches welcome)! A real application would obviously be doing much more work than simply looking up ids, and the results may be different as hotspot must compile much more active code. The index is fully hot in my test (plenty of RAM to hold the entire index); for a cold index I would expect the results to be even more stark since avoiding a disk-seek becomes so much more important. In a real application, the ids using timestamps would be more spread apart in time; I could "simulate" this myself by faking the timestamps over a wider range. Perhaps this would close the gap between UUID V1 and Flake IDs? I used only one thread during indexing, but a real application with multiple indexing threads would spread out the ids across multiple segments at once.

I used Lucene's default TieredMergePolicy, but it is possible a smarter merge policy that favored merging segments whose ids were more "similar" might give better results. The test does not do any deletes/updates, which would require more work during lookup since a given id may be in more than one segment if it had been updated (just deleted in all but one of them).

Finally, I used using Lucene's default Codec, but we have nice postings formats optimized for primary-key lookups when you are willing to trade RAM for faster lookups, such as this Google summer-of-code project from last year and MemoryPostingsFormat. Likely these would provide sizable performance gains!

Friday, April 11, 2014

Testing Lucene's index durability after crash or power loss

One of Lucene's useful transactional features is index durability which ensures that, once you successfully call IndexWriter.commit, even if the OS or JVM crashes or power is lost, or you kill -KILL your JVM process, after rebooting, the index will be intact (not corrupt) and will reflect the last successful commit before the crash.

Of course, this only works if your hardware is healthy and your IO devices implement fsync properly (flush their write caches when asked by the OS). If you have data-loss issues, such as a silent bit-flipper in your memory, IO or CPU paths, thanks to the new end-to-end checksum feature (LUCENE-2446), available as of Lucene 4.8.0, Lucene will now detect that as well during indexing or CheckIndex. This is similar to the ZFS file system's block-level checksums, but not everyone uses ZFS yet (heh), and so Lucene now does its own checksum verification on top of the file system.

Be sure to enable checksum verification during merge by calling IndexWriterConfig.setCheckIntegrityAtMerge. In the future we'd like to remove that option and always validate checksums on merge, and we've already done so for the default stored fields format in LUCENE-5580 and (soon) term vectors format in LUCENE-5602, as well as set up the low-level IO APIs so other codec components can do so as well, with LUCENE-5583, for Lucene 4.8.0.

FileDescriptor.sync and fsync

Under the hood, when you call IndexWriter.commit, Lucene gathers up all newly written filenames since the last commit, and invokes FileDescriptor.sync on each one to ensure all changes are moved to stable storage.

At its heart, fsync is a complex operation, as the OS must flush any dirty pages associated with the specified file from its IO buffer cache, work with the underlying IO device(s) to ensure their write caches are also flushed, and also work with the file system to ensure its integrity is preserved. You can separately fsync the bytes or metadata for a file, and also the directory(ies) containing the file. This blog post is a good description of the challenges.

Recently we've been scrutinizing these parts of Lucene, and all this attention has uncovered some exciting issues!

In LUCENE-5570, to be fixed in Lucene 4.7.2, we discovered that the fsync implementation in our FSDirectory implementations is able to bring new 0-byte files into existence. This normally isn't a problem by itself, because IndexWriter shouldn't fsync a file that it didn't create. However, it exacerbates debugging when there is a bug in IndexWriter or in the application using Lucene (e.g., directly deleting index files that it shouldn't). In these cases it's confusing to discover these 0-byte files so much later, versus hitting a FileNotFoundException at the point when IndexWriter tried to fsync them.

In LUCENE-5588, to be fixed in Lucene 4.8.0, we realized we must also fsync the directory holding the index, otherwise it's possible on an OS crash or power loss that the directory won't link to the newly created files or that you won't be able to find your file by its name. This is clearly important because Lucene lists the directory to locate all the commit points (segments_N files), and of course also opens files by their names.

Since Lucene does not rely on file metadata like access time and modify time, it is tempting to use fdatasync (or FileChannel.force(false) from java) to fsync just the file's bytes. However, this is an optimization and at this point we're focusing on bugs. Furthermore, it's likely this won't be any faster since the metadata must still be sync'd by fdatasync if the file length has changed, which is always the case in Lucene since we only append to files when writing (we removed Indexoutput.seek in LUCENE-4399).

In LUCENE-5574, to be fixed as of Lucene 4.7.2, we found that a near-real-time reader, on closing, could delete files even if the writer it was opened from has been closed. This is normally not a problem by itself, because Lucene is write-once (never writes to the same file name more than once), as long as you use Lucene's APIs and don't modify the index files yourself. However, if you implement your own index replication by copying files into the index, and if you don't first close your near-real-time readers, then it is possible closing them would remove the files you had just copied.

During any given indexing session, Lucene writes many files and closes them, many files are deleted after being merged, etc., and only later, when the application finally calls IndexWriter.commit, will IndexWriter then re-open the newly created files in order to obtain a FileDescriptor so we can fsync them.

This approach (closing the original file, and then opening it again later in order to sync), versus never closing the original file and syncing that same file handle you used for writing, is perhaps risky: the javadocs for FileDescriptor.sync are somewhat vague as to whether this approach is safe. However, when we check the documentation for fsync on Unix/Posix and FlushFileBuffers on Windows, they make it clear that this practice is fine, in that the open file descriptor is really only necessary to identify which file's buffers need to be sync'd. It's also hard to imagine an OS that would separately track which open file descriptors had made which changes to the file. Nevertheless, out of paranoia or an abundance of caution, we are also exploring a possible patch on LUCENE-3237 to fsync only the originally opened files.

Testing that fsync really works

With all these complex layers in between your application's call to IndexWriter.commit and the laws of physics ensuring little magnets were flipped or a few electrons were moved into a tiny floating gate in a NAND cell, how can we reliably test that the whole series of abstractions is actually working?

In Lucene's randomized testing framework we have a nice evil Directory implementation called MockDirectoryWrapper. It can do all sorts of nasty things like throw random exceptions, sometimes slow down opening, closing and writing of some files, refuse to delete still-open files (like Windows), refuse to close when there are still open files, etc. This has helped us find all sorts of fun bugs over time.

Another thing it does on close is to simulate an OS crash or power loss by randomly corrupting any un-sycn'd files and then confirming the index is not corrupt. This is useful for catching Lucene bugs where we are failing to call fsync when we should, but it won't catch bugs in our implementation of sync in our FSDirectory classes, such as the frustrating LUCENE-3418 (first appeared in Lucene 3.1 and finally fixed in Lucene 3.4).

So, to catch such bugs, I've created a basic test setup, making use of a simple Insteon on/off device, along with custom Python bindings I created long ago to interact with Insteon devices. I already use these devices all over my home for controlling lights and appliances, so also using this for Lucene is a nice intersection of two of my passions!

The script loops forever, first updating the sources, compiling, checking the index for corruption, then kicking off an indexing run with some randomization in the settings, and finally, waiting a few minutes and then cutting power to the box. Then, it restores power, waits for the machine to be responsive again, and starts again.

So far it's done 80 power cycles and no corruption yet. Good news!

To "test the tester", I tried temporarily changing fsync to do nothing, and indeed after a couple iterations, the index became corrupt. So indeed the test setup seems to "work".

Currently the test uses Linux on a spinning magnets hard drive with the ext4 file system. This is just a start, but it's better than no proper testing for Lucene's fsync. Over time I hope to test different combinations of OS's, file systems, IO hardware, etc.

Wednesday, March 5, 2014

Using Lucene's search server to search Jira issues

You may remember my first blog post describing how the Lucene developers eat our own dog food by using a Lucene search application to find our Jira issues.

That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene's capabilities, I point them to this application so they can see for themselves.

Recently, I've made some further progress so I want to give an update.

The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I've been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene's current modules in a server context with minimal "thin server" additional source code.

Separately, to test this new Lucene based server, and to complete the "dog food," I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira's REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.

I just pushed some further changes to the production site:

I upgraded the Jira search application to the current server branch (previously it was running on my private fork).
I switched all analysis components to Lucene's analysis factories; these factories use Java's SPI (Service Provider Interface) so that the server has access to any char filters, tokenizers and token filters in the classpath. This is very helpful when building a server because it means you don't need any special code to handle the great many analysis components that Lucene provides these days. Everything simply passes through the factories (which know how to parse their own arguments).
I've added the Tika project, so you can now find Tika issues as well. This was very simple to add, and seems be working!
I inserted WordDelimiterFilter so that CamelCaseTokens are split. For example, try searching on infix and note the highlights. As Rober Muir reminded me, WordDelimiterFilter corrupts offsets, which will mess up highlighting in some cases, so I'm going to try to set up ICUTokenizer, which I'm already using, to do this splitting instead.
I switched to Lucene's new expressions module to do blended relevance + recency sort by default when you do a text search, which is helpful because most of the time we are looking for recently touched issues. Previously I used a custom FieldComparator to achieve the same functionality, but expressions is more compact and powerful and lets me remove that custom FieldComparator.
I switched to near-real-time building of the suggestions, using AnalyzingInfixSuggester. Previously I was fully rebuilding the suggester every five minutes, so this saves a lot of CPU since now I just add new Jira issues as they come, and refresh the suggester. It also means a much shorter delay from when an index is added to when it can be suggested. See LUCENE-5477 for details.
I now commit once per day. Previously I never committed, and simply relied on near-real-time searching. This works just fine, except when I need to bring the server down (e.g. to push new changes out), it required full reindexing, which was very fast but a poor user experience for those users who happened to do a search while it was happening. Now, when I bounce the server it comes back to the last commit and then the near-real-time indexing quickly catches up on any changed issues since that last commit.
Various small issues, such as proper handling when a Jira issue is renamed (the Jira REST API does not make it so easy to discover this!); better production push automation; upgraded to a newer version of bootstrap UI library.

There are still plenty of improvements to make to this Jira search application. For fields with many possible drill-down values, I'd like to have a simple suggester so the user can quickly drill down. I'd like to fix the suggester to filter suggestions according to the project. For example, if you've drilled down into Tika issues, then when you type a new search you should see only Tika issues suggested. For that we need to make AnalzyingInfixSuggester context aware. I'd also like a more compact UI for all of the facet fields; maybe I need to hide the less commonly used facet fields under a "More"...

Please send me any feedback / problems when you're searching for issues!

Thursday, January 23, 2014

Finding long tail suggestions using Lucene's new FreeTextSuggester

Lucene's suggest module offers a number of fun auto-suggest implementations to give a user live search suggestions as they type each character into a search box.

For example, WFSTCompletionLookup compiles all suggestions and their weights into a compact Finite State Transducer, enabling fast prefix lookup for basic suggestions.

AnalyzingSuggester improves on this by using an Analyzer to normalize both the suggestions and the user's query so that trivial differences in whitespace, casing, stop-words, synonyms, as determined by the analyzer, do not prevent a suggestion from matching.

Finally, AnalyzingInfixSuggester goes further by allowing infix matches so that words inside each suggestion (not just the prefix) can trigger a match. You can see this one action at the Lucene/Solr Jira search application (e.g., try "python") that I recently created to eat our own dog food. It is also the only suggester implementation so far that supports highlighting (this has proven challenging for the other suggesters).

Yet, a common limitation to all of these suggesters is that they can only suggest from a finite set of previously built suggestions. This may not be a problem if your suggestions are past user queries and you have tons and tons of them (e.g., you are Google). Alternatively, if your universe of suggestions is inherently closed, such as the movie and show titles that Netflix's search will suggest, or all product names on an e-commerce site, then a closed set of suggestions is appropriate.

N-Gram language models

For everyone else, where a high percentage of the incoming queries fall into the never-seen-before long tail, Lucene's newest suggester, FreeTextSuggester, can help! It uses the approach described in this Google blog post.

Rather than precisely matching a previous suggestion, it builds up a simple statistical n-gram language model from all suggestions and looks at the last tokens (plus the prefix of whatever final token the user is typing, if present) to predict the most likely next token.

For example, perhaps the user's query so far is: "flashforge 3d p", and because flashforge is an uncommon brand of 3D printer, this particular suggestion prefix was never added to the suggester. Yet, "3d printer" was a frequently seen phrase in other contexts (different brands). In this case, FreeTextSuggester will see "3d" and the "p" prefix for the next token and predict printer, even though "flashforge 3d printer" was never explicitly added as a suggestion.

You specify the order (N) of the model when you create the suggester: larger values of N require more data to train properly but can make more accurate predictions. All lower order models are also built, so if you specify N=3, you will get trigrams, bigrams and unigrams, all compiled into a single weighted FST for maximum sharing of the text tokens. Of course, larger N will create much larger FSTs. In practice N=3 is the highest you should go, unless you have tons of both suggestions to train and RAM to hold the resulting FST.

To handle sparse data, where a given context (the N-1 prior words) was not seen frequently enough to make accurate predictions, the suggester uses the stupid backoff language model (yes, this is really its name, and yes, it performs well!).

I expect the best way to use this new FreeTextSuggester will be as a fallback: you would first use one of the existing exact match suggesters, but when those suggesters fail to find any suggestions for a given query, because it's "unusual" and has crossed over into the long tail, you then fall back to FreeTextSuggester.

Google seems to use such a modal approach to suggestions as well: if you type "flashforge 3d p" you should see something like this, where each suggestion covers your entire query so far (indeed, Google has heard of the flashforge brand of 3d printer!):

But then if you keep typing and enter "flashforge 3d printer power u", the suggestions change: instead of suggesting an entire query, matching everything I have typed, Google instead suggests the last word or two:

As usual, this feature is very new and likely to contain exciting bugs! See the Jira issue, LUCENE-5214, for details. If you play with this new suggester please start a discussion on the Lucene's user list!

Wednesday, January 8, 2014

Geospatial (distance) faceting using Lucene's dynamic range facets

There have been several recent, quiet improvements to Lucene that, taken together, have made it surprisingly simple to add geospatial distance faceting to any Lucene search application, for example:

  < 1 km (147)
  < 2 km (579)
  < 5 km (2775)

Such distance facets, which allow the user to quickly filter their search results to those that are close to their location, has become especially important lately since most searches are now from mobile smartphones.

In the past, this has been challenging to implement because it's so dynamic and so costly: the facet counts depend on each user's location, and so cannot be cached and shared across users, and the underlying math for spatial distance is complex.

But several recent Lucene improvements now make this surprisingly simple!

First, Lucene's dynamic range faceting has been generalized to accept any ValueSource, not just a numeric doc values field from the index. Thanks to the recently added expressions module, this means you can offer dynamic range facets computed from an arbitrary JavaScript expression, since the expression is compiled on-the-fly to a ValueSource using custom generated Java bytecodes with ASM. Lucene's range faceting is also faster now, using segment trees to quickly assign each value to the matching ranges.

Second, the Haversine distance function was added to the expressions module. The implementation uses impressively fast approximations to the normally costly trigonometric functions, poached in part from the Java Optimized Development Kit project, without sacrificing too much accuracy. It's unlikely the approximations will ever matter in practice, and there is an open issue to further improve the approximation.

Suddenly, armed with these improvements, if you index latitude and longitude as DoubleDocValuesFields in each document, and you know the user's latitude/longitude location for each request, you can easily compute facet counts and offer drill-downs by any set of chosen distances.

First, index your documents with latitude/longitude fields:

Document doc = new Document();
doc.add(new DoubleField("latitude", 40.759011, Field.Store.NO));
doc.add(new DoubleField("longitude", -73.9844722, Field.Store.NO));
writer.addDocument(doc);

At search time, obtain the ValueSource by building a dynamic expression that invokes the Haversine function:

private ValueSource getDistanceValueSource() {
  Expression distance;
  try {
    distance = JavascriptCompiler.compile(
                 "haversin(40.7143528,-74.0059731,latitude,longitude)");
  } catch (ParseException pe) {
    // Should not happen
    throw new RuntimeException(pe);
  }
  SimpleBindings bindings = new SimpleBindings();
  bindings.add(new SortField("latitude", SortField.Type.DOUBLE));
  bindings.add(new SortField("longitude", SortField.Type.DOUBLE));

  return distance.getValueSource(bindings);
}

Instead of the hardwired latitude/longitude above, you should fill in the user's location.

Using that ValueSource, compute the dynamic facet counts like this:

FacetsCollector fc = new FacetsCollector();

searcher.search(new MatchAllDocsQuery(), fc);

Facets facets = new DoubleRangeFacetCounts(
                    "field",
                    getDistanceValueSource(), fc,
                    ONE_KM,
                    TWO_KM,
                    FIVE_KM,
                    TEN_KM);

return facets.getTopChildren(10, "field");

Normally you'd use a "real" query instead of the top-level-browse MatchAllDocsQuery. Finally, once the user picks a distance for drill-down, use the Range.getFilter method and add that to a DrillDownQuery using ConstantScoreQuery:

public TopDocs drillDown(DoubleRange range) throws IOException {
  // Passing no baseQuery means we drill down on all
  // documents ("browse only"):
  DrillDownQuery q = new DrillDownQuery(null);

  q.add("field", new ConstantScoreQuery(
                     range.getFilter(getDistanceValueSource())));

  return searcher.search(q, 10);
}

See the full source code here, from the lucene/demo module.

When I first tested this example, there was a fun bug, and then later the facet APIs were overhauled, so you'll need to wait for the Lucene 4.7 release, or just use the current the 4.x sources, to get this example working.

While this example is simple, and works correctly, there are some clear performance improvements that are possible, such as using a bounding box as a fast match to avoid computing Haversine for hits that are clearly outside of the range of possible drill-downs (patches welcome!). Even so, this is a nice step forward for Lucene's faceting and it's amazing that geospatial distance faceting with Lucene can be so simple.