Changing Bits: June 2013

Saturday, June 22, 2013

2X faster PhraseQuery with Lucene using C++ via JNI

I recently described the new lucene-c-boost github project, which provides amazing speedups (up to 7.8X faster) for common Lucene query types using specialized C++ implementations via JNI.

The code works with a stock Lucene 4.3.0 JAR and default codec, and has a trivial API: just call NativeSearch.search instead of IndexSearcher.search.

Now, a quick update: I've optimized PhraseQuery now as well:

Task	QPS base	StdDev base	QPS opt	StdDev opt	% change
HighPhrase	3.5	(2.7%)	6.5	(0.4%)	1.9 X
MedPhrase	27.1	(1.4%)	51.9	(0.3%)	1.9 X
LowPhrase	7.6	(1.7%)	16.4	(0.3%)	2.2 X

~2X speedup (~90% - ~119%) is nice!

Again, it's great to see a reduced variance on the runtimes since hotspot is mostly not an issue. It's odd that LowPhrase gets slower QPS than MedPhrase: these queries look mis-labelled (I see the LowPhrase queries getting more hits than MedPhrase!).

All changes have been pushed to lucene-c-boost; next I'd like to figure out how to get facets working.

A new Lucene suggester based on infix matches

Suggest, sometimes called auto-suggest, type-ahead search or auto-complete, is now an essential search feature ever since Google added it almost 5 years ago.

Lucene has a number of implementations; I previously described AnalyzingSuggester. Since then, FuzzySuggester was also added, which extends AnalyzingSuggester by also accepting mis-spelled inputs.

Here I describe our newest suggester: AnalyzingInfixSuggester, now going through iterations on the LUCENE-4845 Jira issue.

Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.

You can see it in action at the example Jira search application that I built to showcase various Lucene features.

For example, if you enter japan you should see various issues suggested, including:

SOLR-4945: Japanese Autocomplete and Highlighter broken
LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
LUCENE-3921: Add decompose compound Japanese Katakana token capability to Kuromoji

As you can see, the incoming characters can match not just the prefix of each suggestion but also the prefix of any token within.

Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of EdgeNGramTokenFilter to index the short prefixes of each token, up to length 3 by default, for fast prefix querying.

It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom Collector to stop after finding the first N matching hits since these hits are the best matches when sorting by weight. The lookup method lets you specify whether all terms must be found, or any of the terms (Jira search requires all terms).

Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.

Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with AnalyzingInfixSuggester using an index-time analyzer that dropped all but the first three tokens.

One nice benefit of an index-based approach is AnalyzingInfixSuggester handles highlighting of the matched tokens (red color, above), which has unfortunately proven difficult to provide with the FST-based suggesters. Another benefit is, in theory, the suggester could support near-real-time indexing, but I haven't exposed that in the current patch and probably won't for some time (patches welcome!).

Performance is reasonable: somewhere between AnalyzingSuggester and FuzzySuggester, between 58 - 100 kQPS (details on the issue).

Analysis fun

As with AnalyzingSuggester, AnalyzingInfixSuggester let's you separately configure the index-time vs. search-time analyzers. With Jira search, I enabled stop-word removal at index time, but not at search time, so that a query like or would still successfully find any suggestions containing words starting with or, rather than dropping the term entirely.

Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!

Wednesday, June 19, 2013

Screaming fast Lucene searches using C++ via JNI

At the end of the day, when Lucene executes a query, after the initial setup the true hot-spot is usually rather basic code that decodes sequential blocks of integer docIDs, term frequencies and positions, matches them (e.g. taking union or intersection for BooleanQuery), computes a score for each hit and finally saves the hit if it's competitive, during collection.

Even apparently complex queries like FuzzyQuery or WildcardQuery go through a rewrite process that reduces them to much simpler forms like BooleanQuery.

Lucene's hot-spots are so simple that optimizing them by porting them to native C++ (via JNI) was too tempting!

So I did just that, creating the lucene-c-boost github project, and the resulting speedups are exciting:

Task	QPS base	StdDev base	QPS opt	StdDev opt	% change
AndHighLow	469.2	(0.9%)	316.0	(0.7%)	0.7 X
Fuzzy1	63.0	(3.3%)	62.9	(2.0%)	1.0 X
Fuzzy2	25.8	(3.1%)	37.9	(2.3%)	1.5 X
AndHighMed	50.4	(0.7%)	110.0	(0.9%)	2.2 X
OrHighNotLow	46.8	(5.6%)	106.3	(1.3%)	2.3 X
LowTerm	298.6	(1.8%)	691.4	(3.4%)	2.3 X
OrHighNotMed	34.0	(5.3%)	89.2	(1.3%)	2.6 X
OrHighNotHigh	5.0	(5.7%)	14.2	(0.8%)	2.8 X
Wildcard	17.2	(1.2%)	51.1	(9.5%)	3.0 X
AndHighHigh	21.9	(1.0%)	69.0	(1.0%)	3.5 X
OrHighMed	18.7	(5.7%)	59.6	(1.1%)	3.2 X
OrHighHigh	6.7	(5.7%)	21.5	(0.9%)	3.2 X
OrHighLow	15.7	(5.9%)	50.8	(1.2%)	3.2 X
MedTerm	69.8	(4.2%)	243.0	(2.2%)	3.5 X
OrNotHighHigh	13.3	(5.7%)	46.7	(1.4%)	3.5 X
OrNotHighMed	26.7	(5.8%)	115.8	(2.8%)	4.3 X
HighTerm	22.4	(4.2%)	109.2	(1.4%)	4.9 X
Prefix3	10.1	(1.1%)	55.5	(3.7%)	5.5 X
OrNotHighLow	62.9	(5.5%)	351.7	(9.3%)	5.6 X
IntNRQ	5.0	(1.4%)	38.7	(2.1%)	7.8 X

These results are on the full, multi-segment Wikipedia English index with 33.3 M documents. Besides the amazing speedups, it's also nice to see that the variance (StdDev column) is generally lower with the optimized C++ version, because hotspot has (mostly) been taken out of the equation.

The API is easy to use, and works with the default codec so you won't have to re-index just to try it out: instead of IndexSearcher.search, call NativeSearch.search. If the query can be optimized, it will be; otherwise it will seamlessly fallback to IndexSearcher.search. It's fully decoupled from Lucene and works with the stock Lucene 4.3.0 JAR, using Java's reflection APIs to grab the necessary bits.

This is all very new code, and I'm sure there are plenty of exciting bugs, but (after some fun debugging!) all Lucene core tests now pass when using NativeSearch.search.

This is not a C++ port of Lucene

This code is definitely not a general C++ port of Lucene. Rather, it implements a very narrow set of classes, specifically the common query types. The implementations are not general-purpose: they hardwire (specialize) specific code, removing all abstractions like Scorer, DocsEnum, Collector, DocValuesProducer, etc.

There are some major restrictions on when the optimizations will apply:

Only tested on Linux and Intel CPU so far
Requires Lucene 4.3.x
Must use NativeMMapDirectory as your Directory implementation, which maps entire files into RAM (avoids the chunking that the Java-based MMapDirectory must do)
Must use the default codec
Only sort by score is supported
None of the optimized implementations use advance: first, this code is rather complex and will be quite a bit of work to port to C++, and second, queries that benefit from advance are generally quite fast already so we may as well leave them in Java

BooleanQuery is optimized, but only when all clauses are TermQuery against the same field.

C++ is not faster than java!

Not necessarily, anyway: before anyone goes off screaming how these results "prove" Java is so much slower than C++, remember that this is far from a "pure" C++ vs Java test. There are at least these three separate changes mixed in:

Algorithmic changes. For example, lucene-c-boost sometimes uses BooleanScorer where Lucene is using BooleanScorer2. Really we need to fix Lucene to do similar algorithmic changes (when they are faster). In particular, all of the OrXX queries that include a Not clause, as well as IntNRQ in the above results, benefit from algorithmic changes.
Code specialization: lucene-c-boost implements the searches as big hardwired scary-looking functions, removing all the nice Lucene abstractions. While abstractions are obviously necessary in Lucene, they unfortunately add run-time cost overhead, so removing these abstractions gives some gains.
C++ vs java.

It's not at all clear how much of the gains are due to which part; really I need to create the "matching" specialized Java sources to do a more pure test.

This code is dangerous!

Specifically, whenever native C++ code is embedded in Java, there is always the risk of all those fun problems with C++ that we Java developers thought we left behind. For example, if there are bugs (likely!), or even innocent API mis-use by the application such as accidentally closing an IndexReader while other threads are still using it, the process will hit a Segmentation Fault and the OS will destroy the JVM. There may also be memory leaks! And, yes, the C++ sources even use the goto statement.

Work in progress...

This is a work in progress and there are still many ideas to explore. For example, Lucene 4.3.x's default PostingsFormat stores big-endian longs, which means the little-endian Intel CPU must do byte-swapping when decoding each postings block, so one thing to try is a PostingsFormat better optimized for the CPU at search time. Positional queries, Filters and nested BooleanQuery are not yet optimized, as well as certain configurations (e.g., fields that omit norms). Patches welcome!

Nevertheless, initial results are very promising, and if you are willing to risk the dangers in exchange for massive speedups please give it a whirl and report back.

Sunday, June 9, 2013

Build your own finite state transducer

Have you always wanted your very own Lucene finite state transducer (FST) but you couldn't figure out how to use Lucene's crazy APIs?

Then today is your lucky day! I just built a simple web application that creates an FST from the input/output strings that you enter.

If you just want a finite state automaton (no outputs) then enter only inputs, such as this example:

If all of your outputs are non-negative integers then the FST will use numeric outputs, where you sum up the outputs as you traverse a path to get the final output:

Finally, if the outputs are non-numeric then they are treated as strings, in which case you concatenate as you traverse the path:

The red arcs are the ones with the NEXT optimization: these arcs do not store a pointer to a node because their to-node is the very next node in the FST. This is a good optimization: it generally results in a large reduction of the FST size. The bolded arcs tell you the next node is final; this is most interesting when a prefix of another input is accepted, such as this example:

Here the "r" arc is bolded, telling you that "star" is accepted. Furthermore, that node following the "r" arc has a final output, telling you the overall output for "star" is "abc".

The web app is a simple Python WSGI app; source code is here. It invokes a simple Java tool as a subprocess; source code (including generics violations!) is here.