Thursday, December 12, 2013

Fast range faceting using segment trees and the Java ASM library

In Lucene's facet module we recently added support for dynamic range faceting, to show how many hits match each of a dynamic set of ranges. For example, the Updated drill-down in the Lucene/Solr issue search application uses range facets. Another example is distance facets (< 1 km, < 2 km, etc.), where the distance is dynamically computed based on the user's current location. Price faceting might also use range facets, if the ranges cannot be established during indexing.

To implement range faceting, for each hit, we first calculate the value (the distance, the age, the price) to be aggregated, and then lookup which ranges match that value and increment its counts. Today we use a simple linear search through all ranges, which has O(N) cost, where N is the number of ranges.

But this is inefficient!

Segment trees

There are fun data structures like segment trees and interval trees with O(log(N) + M) cost per lookup, where M is the number of ranges that match the given value. I chose to explore segment trees, as Lucene only requires looking up by a single value (interval trees can also efficiently look up all ranges overlapping a provided range) and also because all the ranges are known up front (interval trees also support dynamically adding or removing ranges).


If the ranges will never overlap, you can use a simple binary search; Guava's ImmutableRangeSet takes this approach. However, Lucene's range faceting allows overlapping ranges so we can't do that.

Segment trees are simple to visualize: you "project" all ranges down on top of one another, creating a one-dimensional Venn diagram, to define the elementary intervals. This classifies the entire range of numbers into a minimal number of distinct ranges, each elementary interval, such that all points in each elementary interval always match the same set of input ranges. The lookup process is then a binary search to determine which elementary interval a point belongs to, recording the matched ranges as you recurse down the tree.

Consider these ranges; the lower number is inclusive and the upper number is exclusive:
 0:  0 – 10
 1:  0 – 20
 2: 10 – 30
 3: 15 – 50
 4: 40 – 70
The elementary intervals (think Venn diagram!) are:
  -∞ – 0
   0 – 10
  10 – 15
  15 – 20
  20 – 30
  30 – 40
  40 – 50
  50 – 70
  70 – ∞
Finally, you build a binary tree on top of the elementary ranges, and then add output range indices to both internal nodes and the leaves of that tree, necessary to prevent adversarial cases that would require too much (O(N^2)) space. During lookup, as you walk down the tree, you gather up the output ranges (indices) you encounter; for our example, each elementary range is assigned the follow range indices as outputs:
  -∞ – 0 →
   0 – 10 → 0
  10 – 15 → 1, 2
  15 – 20 → 1, 2, 3
  20 – 30 → 2, 3
  30 – 40 → 3
  40 – 50 → 3, 4
  50 – 70 → 4
  70 – ∞  →
Some ranges correspond to 1 elementary interval, while other ranges correspond to 2 or 3 or more, in general. Some, 2 in this example, may have no matching input ranges.

Looking up matched ranges

I've pushed all sources described below to new Google code project; the code is still somewhat rough and exploratory, so there are likely exciting bugs lurking, but it does seem to work: it includes (passing!) tests and simple micro-benchmarks.

I started with a basic segment tree implementation as described on the Wikipedia page, for long values, called SimpleLongRangeMultiSet; here's the recursive lookup method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
  private int lookup(Node node, long v, int[] answers, int upto) {
    if (node.outputs != null) {
      for(int range : node.outputs) {
        answers[upto++] = range;
      }
    }
    if (node.left != null) {
      if (v <= node.left.end) {
        upto = lookup(node.left, v, answers, upto);
      } else {
        upto = lookup(node.right, v, answers, upto);
      }
    }

    return upto;
  }


This worked correctly, but I realized there must be non-trivial overhead for the recursion, checking for nulls, the for loop over the output values, etc. Next, I tried switching to parallel arrays to hold the binary tree (ArrayLongRangeMultiSet), where the left child of node N is at 2*N and the right child is at 2*N+1, but this turned out to be slower.

After that I tested a code specializing implementation, first by creating dynamic Java source code from the binary tree. This eliminates the recursion and creates a single simple method that uses a series of if statements, specialized to the specific ranges, to do the binary search and record the range indices. Here's the resulting specialized code, compiled from the above ranges:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
  void lookup(long v, int[] answers) {
    int upto = 0;
    if (v <= 19) {
      if (v <= 9) {
        if (v >= 0) {
          answers[upto++] = 0;
          answers[upto++] = 1;
        }
      } else {
        answers[upto++] = 1;
        answers[upto++] = 2;
        if (v >= 15) {
          answers[upto++] = 3;
        }
      }
    } else {
      if (v <= 39) {
        answers[upto++] = 3;
        if (v <= 29) {
          answers[upto++] = 2;
        }
      } else {
        if (v <= 49) {
          answers[upto++] = 3;
          answers[upto++] = 4;
        } else {
          if (v <= 69) {
            answers[upto++] = 4;
          }
        }
      }
    }
  }


Finally, using the ASM library, I compiled the tree directly to specialized Java bytecode, and this proved to be fastest (up to 2.5X faster in some cases).

As a baseline, I also added the simple linear search method, LinearLongRangeMultiSet; as long as you don't have too many ranges (say 10 or less), its performance is often better than the Java segment tree.

The implementation also allows you to specify the allowed range of input values (for example, maybe all values are >=0 in your usage), which can save an if statement or two in the lookup method.

Counting all matched ranges

While the segment tree allows you to quickly look up all matching ranges for a given value, after a nice tip from fellow Lucene committee Robert Muir, we realized Lucene's range faceting does not need to know the ranges for each value; instead, it only requires the aggregated counts for each range in the end, after seeing many values.

This leads to an optimization: compute counts for each elementary interval and then in the end, roll up those counts to get the count for each range. This will only work for single-valued fields, since for a multi-valued field you'd need to carefully never increment the same range more than once per hit.

So based on that approach, I created a new LongRangeCounter abstract base class, and the SimpleLongRangeCounter Java implementation, and also the ASM specialized version, and the results are indeed faster (~20 to 50%) than using the lookup method to count; I'll use this approach with Lucene.

Segment trees are normally always "perfectly" balanced but one final twist I explored was to use a training set of values to bias the order of the if statements. For example, if your ranges cover a tiny portion of the search space, as is the case for the Updated drill-down, then it should be faster to use a slightly unbalanced tree, by first checking if the value is less than the maximum range. However, in testing, while there are some cases where this "training" is a bit faster, often it's slower; I'm not sure why.

Lucene

I haven't folded this into Lucene yet, but I plan to; all the exploratory code lives in the segment-trees Google code project for now.

Results on the micro-benchmarks can be entirely different once the implementations are folded into a "real" search application. While ASM is a powerful way to generate specialized code, and it gives sizable performance gains at least in the micro-benchmarks, it is an added dependency and complexity for ongoing development and many more developers know Java than ASM. It may also confuse hotspot, causing deoptimizations when there are multiple implementations for an abstract base class. Furthermore, if there are many ranges, the resulting specialized bytecode can be become quite large (but, still O(N*log(N)) in size), which may cause other problems. On balance I'm not sure the sizable performance gains (on a micro-benchmark) warrant using ASM in Lucene's range faceting.

Friday, November 29, 2013

Pulling H264 video from an IP camera using Python

IP cameras have come a long ways, and recently I upgraded some old cameras to these new Lorex cameras (model LNB2151/LNB2153) and I'm very impressed.

These cameras record 1080p wide-angle video at 30 frames per second, use power over ethernet (PoE), can see when it's dark using builtin infrared LEDs and are weather-proof. The video quality is impressive and they are surprisingly inexpensive. The camera can deliver two streams at once, so you can pull a lower resolution stream for preview, motion detection, etc., and simultaneously pull the higher resolution stream to simply record it for later scrutinizing.

After buying a few of these cameras I needed a simple way to pull the raw H264 video from them, and with some digging I discovered the cameras speak RTSP and RTP which are standard protocols for streaming video and audio from IP cameras. Many IP cameras have adopted these standards.

Both VLC and MPlayer can play RTSP/RTP video streams; for the Lorex cameras the default URL is:

  rtsp://admin:000000@<hostname>/PSIA/Streaming/channels/1.

After more digging I found the nice open-source (LGPL license) Live555 project, which is a C++ library for all sorts of media related protocols, including RTSP, RTP and RTCP. VLC and MPlayer use this library for their RTSP support. Perfect!

My C++ is a bit rusty, and I really don't understand all of Live555's numerous APIs, but I managed to cobble together a simple Python extension module, derived from Live555's testRTSPClient.cpp example, that seems to work well.

I've posted my current source code in a new Google code project named pylive555. It provides a very simple API (only 3 functions!) to pull frames from a remote camera via RTSP/RTP; Live555 has many, many other APIs that I haven't exposed.

The code is thread-friendly (releases the global interpreter lock when invoking the Live555 APIs).

I've included a simple example.py Python program, that shows how to load H264 video frames from the camera and save them to a local file. You could start from this example and modify it to do other things, for example use the ffmpeg H264 codec to decode individual frames, use a motion detection library to trigger recording, parse each frame's metadata to find the keyframes, etc. Here's the current example.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import time
import sys
import live555
import threading

# Shows how to use live555 module to pull frames from an RTSP/RTP
# source.  Run this (likely first customizing the URL below:

# Example: python3 example.py 10.17.4.118 1 10 out.264 
if len(sys.argv) != 5:
  print()
  print('Usage: python3 example.py cameraIP channel seconds fileOut')
  print()
  sys.exit(1)
  
cameraIP = sys.argv[1]
channel = sys.argv[2]
seconds = float(sys.argv[3])
fileOut = sys.argv[4]

# NOTE: the username & password, and the URL path, will vary from one
# camera to another!  This URL path works with the Lorex LNB2153:
url = 'rtsp://admin:000000@%s/PSIA/Streaming/channels/%s' % (cameraIP, channel)

fOut = open(fileOut, 'wb')

def oneFrame(codecName, bytes, sec, usec, durUSec):
  print('frame for %s: %d bytes' % (codecName, len(bytes)))
  fOut.write(b'\0\0\0\1' + bytes)

# Starts pulling frames from the URL, with the provided callback:
useTCP = False
live555.startRTSP(url, oneFrame, useTCP)

# Run Live555's event loop in a background thread:
t = threading.Thread(target=live555.runEventLoop, args=())
t.setDaemon(True)
t.start()

endTime = time.time() + seconds
while time.time() < endTime:
  time.sleep(0.1)

# Tell Live555's event loop to stop:
live555.stopEventLoop()

# Wait for the background thread to finish:
t.join()


Installation is very easy; see the README.txt. I've only tested on Linux with Python3.2 and with the Lorex LNB2151 cameras.

I'm planning on installing one of these Lorex cameras inside a bat house that I'll build with the kids this winter. If we're lucky we'll be able to view live bats in the summer!

Tuesday, November 12, 2013

Playing a sound (AIFF) file from Python using PySDL2

Sometimes you need to play sounds or music (digitized samples) from Python, which really ought to be a simple task. Yet it took me a little while to work out, and the resulting source code is quite simple, so I figured I'd share it here in case anybody else is struggling with it.

The Python wiki lists quite a few packages for working with audio, but most of them are overkill for basic audio recording and playback.

For quite some time I had been using PyAudio, which adds Python bindings to the PortAudio project. I really like it because it focuses entirely on recording and playing audio. But, for some reason, when I recently upgraded to Mavericks, it stutters whenever I try to play samples at a sample rate lower than 44.1 KHz. I've emailed the author to try to get to the bottom of it.

In the meantime, I tried a new package, PySDL2, which adds Python bindings to the SDL2 (Simple Directmedia Layer) project.

SDL2 does quite a bit more than basic audio, and I didn't dig into any of that yet. I hit one small issue with PySDL2, but the one-line change in the issue fixes it. Here's the resulting code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import sdl2
import sys
import aifc
import threading

class ReadAIFF:
  def __init__(self, fileName):
    self.a = aifc.open(fileName)
    self.frameUpto = 0
    self.bytesPerFrame = self.a.getnchannels() * self.a.getsampwidth()
    self.numFrames = self.a.getnframes()
    self.done = threading.Event()
    
  def playNextChunk(self, unused, buf, bufSize):
    framesInBuffer = bufSize/self.bytesPerFrame
    framesToRead = min(framesInBuffer, self.numFrames-self.frameUpto)

    if self.frameUpto == self.numFrames:
      self.done.set()

    # TODO: is there a faster way to copy the string into the ctypes
    # pointer/array?
    for i, b in enumerate(self.a.readframes(framesToRead)):
      buf[i] = ord(b)

    # Play silence after:
    # TODO: is there a faster way to zero out the array?
    for i in range(self.bytesPerFrame*framesToRead, self.bytesPerFrame*framesInBuffer):
      buf[i] = 0

    self.frameUpto += framesToRead

if sdl2.SDL_Init(sdl2.SDL_INIT_AUDIO) != 0:
  raise RuntimeError('failed to init audio')

p = ReadAIFF(sys.argv[1])
spec = sdl2.SDL_AudioSpec(p.a.getframerate(),
                          sdl2.AUDIO_S16MSB,
                          p.a.getnchannels(),
                          512,
                          sdl2.SDL_AudioCallback(p.playNextChunk))

# TODO: instead of passing None for the 4th arg, I really should pass
# another AudioSpec and then confirm it matched what I asked for:
devID = sdl2.SDL_OpenAudioDevice(None, 0, spec, None, 0)
if devID == 0:
  raise RuntimeError('failed to open audio device')

# Tell audio device to start playing:
sdl2.SDL_PauseAudioDevice(devID, 0)

# Wait until all samples are done playing
p.done.wait()

sdl2.SDL_CloseAudioDevice(devID)


The code is straightforward: it loads an AIFF file, using Python's builtin aifc module, and then creates a callback, playNextChunk which is invoked by PySDL2 when it needs more samples to play. So far it seems to work very well!

Saturday, September 28, 2013

Lucene now has an in-memory terms dictionary, thanks to Google Summer of Code

Last year, Han Jiang's Google Summer of Code project was a big success: he created a new (now, default) postings format for substantially faster searches, along with smaller indices.

This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.

In fact, he created two new terms dictionary implementations. The first, FSTTermsWriter/Reader, hold all terms and metadata in a single in-memory FST, while the second, FSTOrdTermsWriter/Reader, does the same but also supports retrieving the ordinal for a term (TermsEnum.ord()) and looking up a term given its ordinal (TermsEnum.seekExact(long ord)). The second one also uses this ord internally so that the FST is more compact, while all metadata is stored outside of the FST, referenced by ord.

Like the default BlockTree terms dictionary, these new terms dictionaries accept any PostingsBaseFormat so you can separately plug in whichever format you want to encode/decode the postings.

Han also improved the PostingsBaseFormat API so that there is now a cleaner separation of how terms and their metadata are encoded vs. how postings are encoded; PostingsWriterBase.encodeTerm and PostingsReaderBase.decodeTerm now handle encoding and decoding any term metadata required by the postings format, abstracting away how the long[]/byte[] were persisted by the terms dictionary. Previously this line was annoyingly blurry.

Unfortunately, while the performance for primary key lookups is substantially faster, other queries e.g. WildcardQuery are slower; see LUCENE-3069 for details. Fortunately, using PerFieldPostingsFormat, you are free to pick and choose which fields (e.g. your "id" field) should use the new terms dictionary.

For now this feature is trunk-only (eventually Lucene 5.0).

Thank you Han and thank you Google!

Monday, September 16, 2013

Three exciting Lucene features in one day

Three exciting Lucene features in one day

Yesterday was a productive day: suddenly, there are three exciting new features coming to Lucene.

Expressions module

The first feature, committed yesterday, is the new expressions module. This allows you to define a dynamic field for sorting, using an arbitrary String expression. There is builtin support for parsing JavaScript, but the parser is pluggable if you want to create your own syntax.

For example, you could define a sort field using the expression
  sqrt(_score) + ln(popularity)
if you want to offer a blended sort primarily by relevance and boosting by a popularity field.

The code is very easy to use; there are some nice examples in the TestDemoExpressions.java unit test case, and this will be available in Lucene's next stable release (4.6).

Updateable numeric doc-values fields

The second feature, also committed yesterday, is updateable numeric doc-values fields, letting you change previously indexed numeric values using the new updateNumericDocValue method on IndexWriter. It works fine with near-real-time readers, so you can update the numeric values for a few documents and then re-open a new near-real-time reader to see the changes.

The feature is currently trunk only as we work out a few remaining issues involving an particularly controversial boolean. It also currently does not work on sparse fields, i.e. you can only update a document's value if that document had already indexed that field in the first place.

Combined, these two features enable powerful use-cases where you want to sort by a blended field that is changing over time. For example, perhaps you measure how often your users click through each document in the search results, and then use that to update the popularity field, which is then used for a blended sort. This way the rankings of the search results change over time as you learn from users which documents are popular and which are not.

Of course such a feature was always possible before, using custom external code, but with both expressions and updateable doc-values now available it becomes trivial to implement!

Free text suggestions

Finally, the third feature is a new suggester implementation, FreeTextSuggester. It is a very different suggester than the existing ones: rather than suggest from a finite universe of pre-built suggestions, it uses a simple ngram language model to predict the "long tail" of possible suggestions based on the 1 or 2 previous tokens.

Under the hood, it uses ShingleFilter to create the ngrams, and an FST to store and lookup the resulting ngram models. While multiple ngram models are stored compactly in a single FST, the FST can still get quite large; the 3-gram, 2-gram and 1-gram model built on the AOL query logs is 19.4 MB (the queries themselves are 25.4 MB). This was inspired by Google's approach.

Likely this suggester would not be used by itself, but rather as a fallback when your primary suggester failed to find any suggestions; you can see this behavior with Google. Try searching for "the fast and the ", and you will see the suggestions are still full queries. But if the next word you type is "burning" then suddenly google (so far!) does not have a full suggestion and falls back to their free text approach.

Wednesday, August 14, 2013

SuggestStopFilter carefully removes stop words for suggesters

Lucene now has a nice set of suggesters that use an analyzer to tokenize the suggestions: AnalyzingSuggester, FuzzySuggester and AnalyzingInfixSuggester. Using an analyzer is powerful because it lets you customize exactly how suggestions are matched: you can normalize case, apply stemming, match across different synonym forms, etc.

One of the most common things you'll do with your analyzer is to remove stop-words using StopFilter. Unfortunately, if you try this, you'll quickly notice that the stop filter is too aggressive because it happily removes the last token even if the user isn't done typing it yet. For example if the user has typed "a", you'd expect suggestions like apple, aardvark, etc., but you won't get that because StopFilter removed the "a" token.

You could try using StopFilter only while indexing, which was my first attempt with the suggestions at jirasearch.mikemccandless.com, but then, at least for AnalyzingInfixSuggester, you'll fail to get matches when you pass allTermsRequired=true because the suggester then requires that even stop words find matches.

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won't change it. This way a query "a" can find "apple", but a query "a " (with a trailing space) will find nothing because the "a" will be removed.

I've pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

Friday, August 2, 2013

A new version of the Compact Language Detector

It's been almost two years since I originally factored out the fast and accurate Compact Language Detector from the Chromium project, and the effort was clearly worthwhile: the project is popular and others have created additional bindings for languages including at least Perl, Ruby, R, JavaScript, PHP and C#/.NET.

Eric Fischer used CLD to create the colorful Twitter language map, and since then further language maps have appeared, e.g. for New York and London. What a multi-lingual world we live in!

Suddenly, just a few weeks ago, I received an out-of-the-blue email from Dick Sites, creator of CLD, with great news: he was finishing up version 2.0 of CLD and had already posted the source code on a new project.

So I've now reworked the Python bindings and ported the unit tests to Python (they pass!) to take advantage of the new features. It was much easier this time around since the CLD2 sources were already pulled out into their own project (thank you Dick and Google!).

There are a number of improvements over the previous version of CLD:
  • Improved accuracy.

  • Upgraded to Unicode 6.2 characters.

  • More languages detected: 83 languages, up from 64 previously.

  • A new "full language table" detector, available in Python as a separate cld2full module, that detects 161 languages. This increases the C library size from 1.8 MB (for 83 languages) to 5.5 MB (for 161 languages). Details are here.

  • An option to identify which parts (byte ranges) of the text contain which language, in case the application needs to do further language-specific processing. From Python, pass the optional returnVectors=True argument to get the byte ranges, but note that this requires additional non-trivial CPU cost. This wiki page shows very interesting statistics on how frequently different languages appear in one page, across top web sites, showing the importance of handling multiple languages in a single text input.

  • A new hintLanguageHTTPHeaders parameter, which you can pass from the Content-Language HTTP header. Also, CLD2 will spot any lang=X attribute inside the <html> tag itself (if you pass it HTML).
In the new Python bindings, I've exposed CLD2's debug* flags, to add verbosity to CLD2's detection process. This document describes how to interpret the resulting output.

The detect function returns up to 3 top detected languages. Each detected language includes the percent of the text that was detected as the language, and a confidence score. The function no longer returns a single "picked" summary language, and the pickSummaryLanguage option has been removed: this option was apparently present for internal backwards compatibility reasons and did not improve accuracy.

Remember that the provided input must be valid UTF-8 bytes, otherwise all sorts of things could go wrong (wrong results, segmentation fault).

To see the list of detected languages, just run this python -c "import cld2; print cld2.DETECTED_LANGUAGES", or python -c "import cld2full; print cld2full.DETECTED_LANGUAGES" to see the full set of languages.

The README gives details on how to build and install CLD2.

Once again, thank you Google, and thank you Dick Sites for making this very useful library available to the world as open-source.

Saturday, June 22, 2013

2X faster PhraseQuery with Lucene using C++ via JNI

I recently described the new lucene-c-boost github project, which provides amazing speedups (up to 7.8X faster) for common Lucene query types using specialized C++ implementations via JNI.

The code works with a stock Lucene 4.3.0 JAR and default codec, and has a trivial API: just call NativeSearch.search instead of IndexSearcher.search.

Now, a quick update: I've optimized PhraseQuery now as well:

TaskQPS base StdDev base QPS opt StdDev opt % change
HighPhrase3.5(2.7%)6.5(0.4%)1.9 X
MedPhrase27.1(1.4%)51.9(0.3%)1.9 X
LowPhrase7.6(1.7%)16.4(0.3%)2.2 X


~2X speedup (~90% - ~119%) is nice!

Again, it's great to see a reduced variance on the runtimes since hotspot is mostly not an issue. It's odd that LowPhrase gets slower QPS than MedPhrase: these queries look mis-labelled (I see the LowPhrase queries getting more hits than MedPhrase!).

All changes have been pushed to lucene-c-boost; next I'd like to figure out how to get facets working.

A new Lucene suggester based on infix matches

Suggest, sometimes called auto-suggest, type-ahead search or auto-complete, is now an essential search feature ever since Google added it almost 5 years ago.

Lucene has a number of implementations; I previously described AnalyzingSuggester. Since then, FuzzySuggester was also added, which extends AnalyzingSuggester by also accepting mis-spelled inputs.

Here I describe our newest suggester: AnalyzingInfixSuggester, now going through iterations on the LUCENE-4845 Jira issue.

Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.

You can see it in action at the example Jira search application that I built to showcase various Lucene features.

For example, if you enter japan you should see various issues suggested, including:
  • SOLR-4945: Japanese Autocomplete and Highlighter broken
  • LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
  • LUCENE-3921: Add decompose compound Japanese Katakana token capability to Kuromoji
As you can see, the incoming characters can match not just the prefix of each suggestion but also the prefix of any token within.

Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of EdgeNGramTokenFilter to index the short prefixes of each token, up to length 3 by default, for fast prefix querying.

It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom Collector to stop after finding the first N matching hits since these hits are the best matches when sorting by weight. The lookup method lets you specify whether all terms must be found, or any of the terms (Jira search requires all terms).

Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.

Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with AnalyzingInfixSuggester using an index-time analyzer that dropped all but the first three tokens.

One nice benefit of an index-based approach is AnalyzingInfixSuggester handles highlighting of the matched tokens (red color, above), which has unfortunately proven difficult to provide with the FST-based suggesters. Another benefit is, in theory, the suggester could support near-real-time indexing, but I haven't exposed that in the current patch and probably won't for some time (patches welcome!).

Performance is reasonable: somewhere between AnalyzingSuggester and FuzzySuggester, between 58 - 100 kQPS (details on the issue).

Analysis fun

As with AnalyzingSuggester, AnalyzingInfixSuggester let's you separately configure the index-time vs. search-time analyzers. With Jira search, I enabled stop-word removal at index time, but not at search time, so that a query like or would still successfully find any suggestions containing words starting with or, rather than dropping the term entirely.

Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!

Wednesday, June 19, 2013

Screaming fast Lucene searches using C++ via JNI

At the end of the day, when Lucene executes a query, after the initial setup the true hot-spot is usually rather basic code that decodes sequential blocks of integer docIDs, term frequencies and positions, matches them (e.g. taking union or intersection for BooleanQuery), computes a score for each hit and finally saves the hit if it's competitive, during collection.

Even apparently complex queries like FuzzyQuery or WildcardQuery go through a rewrite process that reduces them to much simpler forms like BooleanQuery.

Lucene's hot-spots are so simple that optimizing them by porting them to native C++ (via JNI) was too tempting!

So I did just that, creating the lucene-c-boost github project, and the resulting speedups are exciting:

    
TaskQPS base StdDev base QPS opt StdDev opt % change
AndHighLow469.2(0.9%)316.0(0.7%)0.7 X
Fuzzy163.0(3.3%)62.9(2.0%)1.0 X
Fuzzy225.8(3.1%)37.9(2.3%)1.5 X
AndHighMed50.4(0.7%)110.0(0.9%)2.2 X
OrHighNotLow46.8(5.6%)106.3(1.3%)2.3 X
LowTerm298.6(1.8%)691.4(3.4%)2.3 X
OrHighNotMed34.0(5.3%)89.2(1.3%)2.6 X
OrHighNotHigh5.0(5.7%)14.2(0.8%)2.8 X
Wildcard17.2(1.2%)51.1(9.5%)3.0 X
AndHighHigh21.9(1.0%)69.0(1.0%)3.5 X
OrHighMed18.7(5.7%)59.6(1.1%)3.2 X
OrHighHigh6.7(5.7%)21.5(0.9%)3.2 X
OrHighLow15.7(5.9%)50.8(1.2%)3.2 X
MedTerm69.8(4.2%)243.0(2.2%)3.5 X
OrNotHighHigh13.3(5.7%)46.7(1.4%)3.5 X
OrNotHighMed26.7(5.8%)115.8(2.8%)4.3 X
HighTerm22.4(4.2%)109.2(1.4%)4.9 X
Prefix310.1(1.1%)55.5(3.7%)5.5 X
OrNotHighLow62.9(5.5%)351.7(9.3%)5.6 X
IntNRQ5.0(1.4%)38.7(2.1%)7.8 X


These results are on the full, multi-segment Wikipedia English index with 33.3 M documents. Besides the amazing speedups, it's also nice to see that the variance (StdDev column) is generally lower with the optimized C++ version, because hotspot has (mostly) been taken out of the equation.

The API is easy to use, and works with the default codec so you won't have to re-index just to try it out: instead of IndexSearcher.search, call NativeSearch.search. If the query can be optimized, it will be; otherwise it will seamlessly fallback to IndexSearcher.search. It's fully decoupled from Lucene and works with the stock Lucene 4.3.0 JAR, using Java's reflection APIs to grab the necessary bits.

This is all very new code, and I'm sure there are plenty of exciting bugs, but (after some fun debugging!) all Lucene core tests now pass when using NativeSearch.search.

This is not a C++ port of Lucene

This code is definitely not a general C++ port of Lucene. Rather, it implements a very narrow set of classes, specifically the common query types. The implementations are not general-purpose: they hardwire (specialize) specific code, removing all abstractions like Scorer, DocsEnum, Collector, DocValuesProducer, etc.

There are some major restrictions on when the optimizations will apply:
  • Only tested on Linux and Intel CPU so far

  • Requires Lucene 4.3.x

  • Must use NativeMMapDirectory as your Directory implementation, which maps entire files into RAM (avoids the chunking that the Java-based MMapDirectory must do)

  • Must use the default codec

  • Only sort by score is supported

  • None of the optimized implementations use advance: first, this code is rather complex and will be quite a bit of work to port to C++, and second, queries that benefit from advance are generally quite fast already so we may as well leave them in Java

BooleanQuery is optimized, but only when all clauses are TermQuery against the same field.

C++ is not faster than java!

Not necessarily, anyway: before anyone goes off screaming how these results "prove" Java is so much slower than C++, remember that this is far from a "pure" C++ vs Java test. There are at least these three separate changes mixed in:
  • Algorithmic changes. For example, lucene-c-boost sometimes uses BooleanScorer where Lucene is using BooleanScorer2. Really we need to fix Lucene to do similar algorithmic changes (when they are faster). In particular, all of the OrXX queries that include a Not clause, as well as IntNRQ in the above results, benefit from algorithmic changes.

  • Code specialization: lucene-c-boost implements the searches as big hardwired scary-looking functions, removing all the nice Lucene abstractions. While abstractions are obviously necessary in Lucene, they unfortunately add run-time cost overhead, so removing these abstractions gives some gains.

  • C++ vs java.
It's not at all clear how much of the gains are due to which part; really I need to create the "matching" specialized Java sources to do a more pure test.

This code is dangerous!

Specifically, whenever native C++ code is embedded in Java, there is always the risk of all those fun problems with C++ that we Java developers thought we left behind. For example, if there are bugs (likely!), or even innocent API mis-use by the application such as accidentally closing an IndexReader while other threads are still using it, the process will hit a Segmentation Fault and the OS will destroy the JVM. There may also be memory leaks! And, yes, the C++ sources even use the goto statement.

Work in progress...

This is a work in progress and there are still many ideas to explore. For example, Lucene 4.3.x's default PostingsFormat stores big-endian longs, which means the little-endian Intel CPU must do byte-swapping when decoding each postings block, so one thing to try is a PostingsFormat better optimized for the CPU at search time. Positional queries, Filters and nested BooleanQuery are not yet optimized, as well as certain configurations (e.g., fields that omit norms). Patches welcome!

Nevertheless, initial results are very promising, and if you are willing to risk the dangers in exchange for massive speedups please give it a whirl and report back.

Sunday, June 9, 2013

Build your own finite state transducer

Have you always wanted your very own Lucene finite state transducer (FST) but you couldn't figure out how to use Lucene's crazy APIs?

Then today is your lucky day! I just built a simple web application that creates an FST from the input/output strings that you enter.

If you just want a finite state automaton (no outputs) then enter only inputs, such as this example:



If all of your outputs are non-negative integers then the FST will use numeric outputs, where you sum up the outputs as you traverse a path to get the final output:

Finally, if the outputs are non-numeric then they are treated as strings, in which case you concatenate as you traverse the path:

The red arcs are the ones with the NEXT optimization: these arcs do not store a pointer to a node because their to-node is the very next node in the FST. This is a good optimization: it generally results in a large reduction of the FST size. The bolded arcs tell you the next node is final; this is most interesting when a prefix of another input is accepted, such as this example:



Here the "r" arc is bolded, telling you that "star" is accepted. Furthermore, that node following the "r" arc has a final output, telling you the overall output for "star" is "abc".

The web app is a simple Python WSGI app; source code is here. It invokes a simple Java tool as a subprocess; source code (including generics violations!) is here.

Tuesday, May 21, 2013

Dynamic faceting with Lucene

Lucene's facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I'll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.

To understand these features, and why they are important, we first need a little background. Lucene's facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.

At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.

This is in contrast to purely dynamic faceting implementations like ElasticSearch's and Solr's, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.

However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr's UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.

Enough background, now on to our two new features!


Sorted-set doc-values faceting

These features bring two dynamic alternatives to the facet module, both computing facet counts from previously indexed doc-values fields. The first feature, sorted-set doc-values faceting (see LUCENE-4795), allows the application to index a normal sorted-set doc-values field, for example:
  doc.add(new SortedSetDocValuesField("foo"));
  doc.add(new SortedSetDocValuesField("bar"));
and then to compute facet counts at search time using SortedSetDocValuesAccumulator and SortedSetDocValuesReaderState.

This feature does not use the taxonomy index, since all state is stored in the doc-values, but the tradeoff is that on each near-real-time reopen, a top-level data-structure is recomputed to map per-segment integer ordinals to global ordinals. The good news is this should be relatively low cost since it's just merge-sorting already sorted terms, and it doesn't need to visit the documents (unlike UnInvertedField).

At search time there is also a small performance hit (~25%, depending on the query) since each per-segment ord must be re-mapped to the global ord space. Likely this could be improved (no time was spend optimizing). Furthermore, this feature currently only works with non-hierarchical facet fields, though this should be fixable (patches welcome!).


Dynamic range faceting

The second new feature, dynamic range faceting, works on top of a numeric doc-values field (see LUCENE-4965), and implements dynamic faceting over numeric ranges. You create a RangeFacetRequest, providing custom ranges with their labels. Each matched document is checked against all ranges and the count is incremented when there is a match. The range-test is a naive simple linear search, which is probably OK since there are usually only a few ranges, but we could eventually upgrade this to an interval tree to get better performance (patches welcome!).

Likewise, this new feature does not use the taxonomy index, only a numeric doc-values field. This feature is especially useful with time-based fields. You can see it in action in the Jira issues search example in the Updated field.

Happy faceting!

Monday, May 13, 2013

Eating dog food with Lucene


Eating your own dog food is important in all walks of life: if you are a chef you should taste your own food; if you are a doctor you should treat yourself when you are sick; if you build houses for a living you should live in a house you built; if you are a parent then try living by the rules that you set for your kids (most parents would fail miserably at this!); and if you build software you should constantly use your own software.

So, for the past few weeks I've been doing exactly that: building a simple Lucene search application, searching all Lucene and Solr Jira issues, and using it instead of Jira's search whenever I need to go find an issue.

It's currently running at jirasearch.mikemccandless.com and it's still quite rough (feedback welcome!).

It's a good showcase of a number of Lucene features:

  • Highlighting using the new PostingsHighlighter; for example, try searching for fuzzy query.

  • Autosuggest, using the not-yet-committed AnalyzingInfixSuggester (LUCENE-4845).

  • Sorting by various fields, including a blended recency and relevance sort.

  • A few synonym examples, for example try searching for oome.

  • Near-real-time searching, and controlled searcher versions: the server uses NRTManager, SearcherLifetimeManager and SearcherManager.

  • ToParentBlockJoinQuery: each issue is indexed as a parent document, and then each comment on the issue is indexed as a separate child document. This allows the server to know which specific comment, along with its metadata, was a match for the query, and if you click on that comment (in the highlighted results) it will take you to that comment in Jira. This is very helpful for mega-issues!

  • Okapi BM25 for ranking.

The drill-downs on the left also show a number of features from Lucene's facet module:
  • Drill sideways for all fields, so that the field does not disappear when you drill down on it.

  • Dynamic range faceting: the Updated drill-down is computed dynamically, e.g. all issues updated in the past week.

  • Hierarchical fields, which are simple since the Lucene facet module supports hierarchy natively. Only the Component dimension is hierarchical, e.g. look at the Component drill down for all Lucene core issues.

  • Multi-select faceting (hold down the shift key when clicking on a value), e.g. all improvements and new features.

  • Multi-valued fields (e.g. User, Fix version, Label).



This is really eating two different dog foods: first, as a developer I see what one must go through to build a search server on top of Lucene's APIs, but second, as an end user, I experience the resulting search user interface whenever I need to find a Lucene or Solr issue. It's like having to eat both wet and dry dog food at once, and both kinds of dog food have uncovered numerous issues!

The issues ranged from outright bugs such as PostingsHighlighter picking the worst snippets instead of the best (LUCENE-4826), to missing features like dynamic numeric range facets (LUCENE-4965), to issues that make consuming Lucene's APIs awkward, especially when mixing different features, such as the difficulty of mixing non-range and range facets with DrillSideways (LUCENE-4980) and the difficulty of using NRTManager with both a taxonomy index and a search index (LUCENE-4967), or finally just inefficient, such as the inability to customize how PostingsHighlighter loads its field values (LUCENE-4846).

The process is far from done! There are still a number of issues that need fixing. For example, it's not easy to mix ToParentBlockJoinQuery and grouping, which is frustrating because fields like who reported an issue, severity, issue status, component would all be natural group-by fields. Some issues, such as the inability of PostingsFormatter to render directly to a JSONObject (LUCENE-4896) are still open because they are challenging to fix cleanly. Others, such as the infix suggester (LUCENE-4845) are in limbo because of disagreements on the best approach, while still others, like BlendedComparator used to sort by mixed relevance and recency, I just haven't pushed back into Lucene yet.

There are plenty of ways to provoke an error from the server; here are two fun examples: try fielded search such as summary:python, or a multi-select drilldown on the Updated field.

Much work remains and I'll keep on eating both wet and dry dog food: it's a very productive way to find problems in Lucene!

Sunday, February 24, 2013

Drill Sideways faceting with Lucene

Lucene's facet module, as I described previously, provides a powerful implementation of faceted search for Lucene. There's been a lot of progress recently, including awesome performance gains as measured by the nightly performance tests we run for Lucene:



That's a nearly 3.8X speedup!

But beyond speedups there are also new features, and here I'll describe the new DrillSideways class being developed under LUCENE-4748.

In the typical faceted search UI, you'll see a column to the left of your search results showing each field (Price, Manufacturer, etc.) enabled for refinement, displaying the unique values or ranges along with their counts. The count tells you how many matches from your current search you'll see if you drill down on that value or range, by clicking it and adding that filter to your current search. If you go shopping for Internal SSDs at Newegg you'll see something like the drill down on the left.

If you click on one of those values at Newegg, say 129GB - 256GB, you'll notice that the field then disappears from the drill down options and jumps to the top of the page, as part of the search breadcrumb, where the only thing you can do with it is drill up by clicking the (X) to remove the filter. What if you drill down on several fields, but then want to explore alternate values for those fields? You can't really do that: you're stuck. This is a poor user experience, and I often find myself making excessive use of the back button to explore different options on sites that only offer drill down and drill up: the UI is too restrictive. Overstock.com is another example of a pure drill down/up UI.

Other sites improve on this by offering drill sideways, or the option to pick alternative or additional facet values for a field you've previously drilled down on.

For example, try searching for an LED television at Amazon, and look at the Brand field, seen in the image to the right: this is a multi-select UI, allowing you to select more than one value. When you select a value (check the box or click on the value), your search is filtered as expected, but this time the field does not disappear: it stays where it was, allowing you to then drill sideways on additional values. Much better!

LinkedIn's faceted search, seen on the left, takes this even further: not only are all fields drill sideways and multi-select, but there is also a text box at the bottom for you to choose a value not shown in the top facet values.

To recap, a single-select field only allows picking one value at a time for filtering, while a multi-select field allows picking more than one. Separately, drilling down means adding a new filter to your search, reducing the number of matching docs. Drilling up means removing an existing filter from your search, expanding the matching documents. Drilling sideways means changing an existing filter in some way, for example picking a different value to filter on (in the single-select case), or adding another or'd value to filter on (in the multi-select case).

Whether the field offers drill sideways is orthogonal to whether the field is multi-select. For example, look at the date field at Search-lucene.com: it's a single-select field, but allows for drill sideways (unfortunately the counts seem to be buggy!).

Implementing Drill Sideways

How can one implement drill sideways? It's tricky because, once you've drilled into a specific value for a given field, the other values in that field will of course have zero drill down count since they were filtered out of the result set. So how do you compute those other counts?

A straightforward solution is the "hold one out" approach: if the original query is "foo", and you've drilled down on three fields (say A:a, B:b and C:c), then to compute the facet count for each of the fields A, B and C you re-run the query, computing drill down counts, each time with only the other fields filtered.

For example, the facet counts for A are the drill down counts from the query +foo +B:b +C:c (A:a is left out), while the facet counts for B are the drill down counts from the query +foo +A:a +C:c (B:b is left out), etc. You must also separately run the query with all filters (+foo +A:a +B:b +C:c) to get the actual hits. With Solr you explicitly specify this hold-on-out yourself as part of the facet request.

While this approach will produce the correct counts, it's costly because now you're running 4 queries instead of 1. You may be able to speed this up by using bitsets instead: produce a bitset of all matching documents for just the query "foo", then a bitset for each of the drill downs A, B and C, and then intersect the bitsets as above, and count facets for the resulting docs. Solr uses an approach like this, caching the bitsets across requests.

Yet another approach, and the one we took on LUCENE-4748, is to execute a single query that matches both hits as well as near-misses, which are documents that matched all but one of the filters. This is actually a standard BooleanQuery, where the original query is a MUST clause, and each filter is a SHOULD clause, with minNumberShouldMatch set to the number of drill down filters minus 1. Then, when each match is collected, if all filters match, it's a true hit and we collect the document accordingly and increment the drill down counts. But if it's a near-miss, because exactly one filter failed, then we instead increment drill sideways counts against that one field that didn't match.

Note that when you first drill down on a given field, the resulting drill sideways counts for that field will be the same as the drill down counts just before you drilled down. If the UI can save this state then a small optimization would be to avoid counting the drill sideways counts for the just-drilled-down field, and instead reuse the previous drill down counts.

The DrillSideways class is obviously very new, so I'm sure there are some exciting bugs in it (though it does have a good randomized test). Feedback and patches are welcome!

Saturday, January 26, 2013

Getting real-time field values in Lucene

We know Lucene's near-real-time search is very fast: you can easily refresh your searcher once per second, even at high indexing rates, so that any change to the index is available for searching or faceting at most one second later. For most applications this is plenty fast.

But what if you sometimes need even better than near-real-time? What if you need to look up truly live or real-time values, so for any document id you can retrieve the very last value indexed?

Just use the newly committed LiveFieldValues class!

It's simple to use: when you instantiate it you provide it with your SearcherManager or NRTManager, so that it can subscribe to the RefreshListener to be notified when new searchers are opened, and then whenever you add, update or delete a document, you notify the LiveFieldValues instance. Finally, call the get method to get the last indexed value for a given document id.

This class is simple inside: it holds the values of recently indexed documents in a ConcurrentHashMap, keyed by the document id, to hold documents that were just indexed but not yet available through the near-real-time searcher. Whenever a new near-real-time searcher is successfully opened, it clears the map of all entries that are now included in that searcher. It carefully handles the transition time from when the reopen started to when it finished by checking two maps for the possible value, and failing that, it falls back to the current searcher.

LiveFieldValues is abstract: you must subclass it and implement the lookupFromSearcher method to retrieve a document's value from an IndexSearcher, since how your application stores the values in the searcher is application dependent (stored fields, doc values or even postings, payloads or term vectors).

Note that this class only offers "live get", i.e. you can get the last indexed value for any document, but it does not offer "live search", i.e. you cannot search against the value until the searcher is reopened. Also, the internal maps are only pruned after a new searcher is opened, so RAM usage will grow unbounded if you never reopen! It's up to your application to ensure that the same document id is never updated simultaneously (in different threads) because in that case you cannot know which update "won" (Lucene does not expose this information, although LUCENE-3424 is one possible solution for this).

An example use-case is to store a version field per document so that you know the last version indexed for a given id; you can then use this to reject a later but out-of-order update for that same document whose version is older than the version already indexed.

LiveFieldValues will be available in the next Lucene release (4.2).

Thursday, January 10, 2013

Taming Text is released!

There's a new exciting book just published from Manning, with the catchy title Taming Text, by Grant S. Ingersoll (fellow Apache Lucene committer), Thomas S. Morton, and Andrew L. Farris.

I enjoyed the (e-)book: it does a good job covering a truly immense topic that could easily have taken several books. Text processing has become vital for businesses to remain competitive in this digital age, with the amount of online unstructured content growing exponentially with time. Yet, text is also a messy and therefore challenging science: the complexities and nuances of human language don't follow a few simple, easily codified rules and are still not fully understood today.

The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!

N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.

The final chapter, "Untamed Text", is especially fun: the sections, some of which are contributed by additional authors, address very challenging topics like semantics extraction, document summarization, relationship extraction, identifying important content and people, detecting emotions with sentiment analysis and cross-language information retrieval.

There were a few topics I expected to see but seemed to be missing. There was no coverage of the Unicode standard (e.g. encodings, and useful standards such as UAX#29 text segmentation). Multi-lingual issues were not addressed; all examples are English. Finite-state transducers were also missing, even though these are powerful tools for text processing. Lucene uses FSTs in a number of places: efficient synonym-filtering, character filtering during analysis, fast auto-suggest, tokenizing Japanese text, in-memory postings format. Still, it's fine that some topics are missing: text processing is an immense field and something has to be cut!

The book is unfortunately based on Lucene/Solr 3.x, so new features only in Lucene/Solr 4.0 are missing, for example the new DirectSpellChecker, scoring models beyond TF/IDF Vector Space Model. Chapter 4, Fuzzy text searching, didn't mention Lucene's new FuzzyQuery nor the very fast Levenshtein Automata approach it uses for finding all fuzzy matches from a large set of terms.

All in all the book is a great introduction to how to leverage numerous open-source tools to process text.