Changing Bits

Wednesday, March 5, 2014

Using Lucene's search server to search Jira issues

You may remember my first blog post describing how the Lucene developers eat our own dog food by using a Lucene search application to find our Jira issues.

That application has become a powerful showcase of a number of modern Lucene features such as drill sideways and dynamic range faceting, a new suggester based on infix matches, postings highlighter, block-join queries so you can jump to a specific issue comment that matched your search, near-real-time indexing and searching, etc. Whenever new users ask me about Lucene's capabilities, I point them to this application so they can see for themselves.

Recently, I've made some further progress so I want to give an update.

The source code for the simple Netty-based Lucene server is now available on this subversion branch (see LUCENE-5376 for details). I've been gradually adding coverage for additional Lucene modules, including facets, suggesters, analysis, queryparsers, highlighting, grouping, joins and expressions. And of course normal indexing and searching! Much remains to be done (there are plenty of nocommits), and the goal here is not to build a feature rich search server but rather to demonstrate how to use Lucene's current modules in a server context with minimal "thin server" additional source code.

Separately, to test this new Lucene based server, and to complete the "dog food," I built a simple Jira search application plugin, to help us find Jira issues, here. This application has various Python tools to extract and index Jira issues using Jira's REST API and a user-interface layer running as a Python WSGI app, to send requests to the server and render responses back to the user. The goal of this Jira search application is to make it simple to point it at any Jira instance / project and enable full searching over all issues.

I just pushed some further changes to the production site:

I upgraded the Jira search application to the current server branch (previously it was running on my private fork).
I switched all analysis components to Lucene's analysis factories; these factories use Java's SPI (Service Provider Interface) so that the server has access to any char filters, tokenizers and token filters in the classpath. This is very helpful when building a server because it means you don't need any special code to handle the great many analysis components that Lucene provides these days. Everything simply passes through the factories (which know how to parse their own arguments).
I've added the Tika project, so you can now find Tika issues as well. This was very simple to add, and seems be working!
I inserted WordDelimiterFilter so that CamelCaseTokens are split. For example, try searching on infix and note the highlights. As Rober Muir reminded me, WordDelimiterFilter corrupts offsets, which will mess up highlighting in some cases, so I'm going to try to set up ICUTokenizer, which I'm already using, to do this splitting instead.
I switched to Lucene's new expressions module to do blended relevance + recency sort by default when you do a text search, which is helpful because most of the time we are looking for recently touched issues. Previously I used a custom FieldComparator to achieve the same functionality, but expressions is more compact and powerful and lets me remove that custom FieldComparator.
I switched to near-real-time building of the suggestions, using AnalyzingInfixSuggester. Previously I was fully rebuilding the suggester every five minutes, so this saves a lot of CPU since now I just add new Jira issues as they come, and refresh the suggester. It also means a much shorter delay from when an index is added to when it can be suggested. See LUCENE-5477 for details.
I now commit once per day. Previously I never committed, and simply relied on near-real-time searching. This works just fine, except when I need to bring the server down (e.g. to push new changes out), it required full reindexing, which was very fast but a poor user experience for those users who happened to do a search while it was happening. Now, when I bounce the server it comes back to the last commit and then the near-real-time indexing quickly catches up on any changed issues since that last commit.
Various small issues, such as proper handling when a Jira issue is renamed (the Jira REST API does not make it so easy to discover this!); better production push automation; upgraded to a newer version of bootstrap UI library.

There are still plenty of improvements to make to this Jira search application. For fields with many possible drill-down values, I'd like to have a simple suggester so the user can quickly drill down. I'd like to fix the suggester to filter suggestions according to the project. For example, if you've drilled down into Tika issues, then when you type a new search you should see only Tika issues suggested. For that we need to make AnalzyingInfixSuggester context aware. I'd also like a more compact UI for all of the facet fields; maybe I need to hide the less commonly used facet fields under a "More"...

Please send me any feedback / problems when you're searching for issues!

Thursday, January 23, 2014

Finding long tail suggestions using Lucene's new FreeTextSuggester

Lucene's suggest module offers a number of fun auto-suggest implementations to give a user live search suggestions as they type each character into a search box.

For example, WFSTCompletionLookup compiles all suggestions and their weights into a compact Finite State Transducer, enabling fast prefix lookup for basic suggestions.

AnalyzingSuggester improves on this by using an Analyzer to normalize both the suggestions and the user's query so that trivial differences in whitespace, casing, stop-words, synonyms, as determined by the analyzer, do not prevent a suggestion from matching.

Finally, AnalyzingInfixSuggester goes further by allowing infix matches so that words inside each suggestion (not just the prefix) can trigger a match. You can see this one action at the Lucene/Solr Jira search application (e.g., try "python") that I recently created to eat our own dog food. It is also the only suggester implementation so far that supports highlighting (this has proven challenging for the other suggesters).

Yet, a common limitation to all of these suggesters is that they can only suggest from a finite set of previously built suggestions. This may not be a problem if your suggestions are past user queries and you have tons and tons of them (e.g., you are Google). Alternatively, if your universe of suggestions is inherently closed, such as the movie and show titles that Netflix's search will suggest, or all product names on an e-commerce site, then a closed set of suggestions is appropriate.

N-Gram language models

For everyone else, where a high percentage of the incoming queries fall into the never-seen-before long tail, Lucene's newest suggester, FreeTextSuggester, can help! It uses the approach described in this Google blog post.

Rather than precisely matching a previous suggestion, it builds up a simple statistical n-gram language model from all suggestions and looks at the last tokens (plus the prefix of whatever final token the user is typing, if present) to predict the most likely next token.

For example, perhaps the user's query so far is: "flashforge 3d p", and because flashforge is an uncommon brand of 3D printer, this particular suggestion prefix was never added to the suggester. Yet, "3d printer" was a frequently seen phrase in other contexts (different brands). In this case, FreeTextSuggester will see "3d" and the "p" prefix for the next token and predict printer, even though "flashforge 3d printer" was never explicitly added as a suggestion.

You specify the order (N) of the model when you create the suggester: larger values of N require more data to train properly but can make more accurate predictions. All lower order models are also built, so if you specify N=3, you will get trigrams, bigrams and unigrams, all compiled into a single weighted FST for maximum sharing of the text tokens. Of course, larger N will create much larger FSTs. In practice N=3 is the highest you should go, unless you have tons of both suggestions to train and RAM to hold the resulting FST.

To handle sparse data, where a given context (the N-1 prior words) was not seen frequently enough to make accurate predictions, the suggester uses the stupid backoff language model (yes, this is really its name, and yes, it performs well!).

I expect the best way to use this new FreeTextSuggester will be as a fallback: you would first use one of the existing exact match suggesters, but when those suggesters fail to find any suggestions for a given query, because it's "unusual" and has crossed over into the long tail, you then fall back to FreeTextSuggester.

Google seems to use such a modal approach to suggestions as well: if you type "flashforge 3d p" you should see something like this, where each suggestion covers your entire query so far (indeed, Google has heard of the flashforge brand of 3d printer!):

But then if you keep typing and enter "flashforge 3d printer power u", the suggestions change: instead of suggesting an entire query, matching everything I have typed, Google instead suggests the last word or two:

As usual, this feature is very new and likely to contain exciting bugs! See the Jira issue, LUCENE-5214, for details. If you play with this new suggester please start a discussion on the Lucene's user list!

Wednesday, January 8, 2014

Geospatial (distance) faceting using Lucene's dynamic range facets

There have been several recent, quiet improvements to Lucene that, taken together, have made it surprisingly simple to add geospatial distance faceting to any Lucene search application, for example:

  < 1 km (147)
  < 2 km (579)
  < 5 km (2775)

Such distance facets, which allow the user to quickly filter their search results to those that are close to their location, has become especially important lately since most searches are now from mobile smartphones.

In the past, this has been challenging to implement because it's so dynamic and so costly: the facet counts depend on each user's location, and so cannot be cached and shared across users, and the underlying math for spatial distance is complex.

But several recent Lucene improvements now make this surprisingly simple!

First, Lucene's dynamic range faceting has been generalized to accept any ValueSource, not just a numeric doc values field from the index. Thanks to the recently added expressions module, this means you can offer dynamic range facets computed from an arbitrary JavaScript expression, since the expression is compiled on-the-fly to a ValueSource using custom generated Java bytecodes with ASM. Lucene's range faceting is also faster now, using segment trees to quickly assign each value to the matching ranges.

Second, the Haversine distance function was added to the expressions module. The implementation uses impressively fast approximations to the normally costly trigonometric functions, poached in part from the Java Optimized Development Kit project, without sacrificing too much accuracy. It's unlikely the approximations will ever matter in practice, and there is an open issue to further improve the approximation.

Suddenly, armed with these improvements, if you index latitude and longitude as DoubleDocValuesFields in each document, and you know the user's latitude/longitude location for each request, you can easily compute facet counts and offer drill-downs by any set of chosen distances.

First, index your documents with latitude/longitude fields:

Document doc = new Document();
doc.add(new DoubleField("latitude", 40.759011, Field.Store.NO));
doc.add(new DoubleField("longitude", -73.9844722, Field.Store.NO));
writer.addDocument(doc);

At search time, obtain the ValueSource by building a dynamic expression that invokes the Haversine function:

private ValueSource getDistanceValueSource() {
  Expression distance;
  try {
    distance = JavascriptCompiler.compile(
                 "haversin(40.7143528,-74.0059731,latitude,longitude)");
  } catch (ParseException pe) {
    // Should not happen
    throw new RuntimeException(pe);
  }
  SimpleBindings bindings = new SimpleBindings();
  bindings.add(new SortField("latitude", SortField.Type.DOUBLE));
  bindings.add(new SortField("longitude", SortField.Type.DOUBLE));

  return distance.getValueSource(bindings);
}

Instead of the hardwired latitude/longitude above, you should fill in the user's location.

Using that ValueSource, compute the dynamic facet counts like this:

FacetsCollector fc = new FacetsCollector();

searcher.search(new MatchAllDocsQuery(), fc);

Facets facets = new DoubleRangeFacetCounts(
                    "field",
                    getDistanceValueSource(), fc,
                    ONE_KM,
                    TWO_KM,
                    FIVE_KM,
                    TEN_KM);

return facets.getTopChildren(10, "field");

Normally you'd use a "real" query instead of the top-level-browse MatchAllDocsQuery. Finally, once the user picks a distance for drill-down, use the Range.getFilter method and add that to a DrillDownQuery using ConstantScoreQuery:

public TopDocs drillDown(DoubleRange range) throws IOException {
  // Passing no baseQuery means we drill down on all
  // documents ("browse only"):
  DrillDownQuery q = new DrillDownQuery(null);

  q.add("field", new ConstantScoreQuery(
                     range.getFilter(getDistanceValueSource())));

  return searcher.search(q, 10);
}

See the full source code here, from the lucene/demo module.

When I first tested this example, there was a fun bug, and then later the facet APIs were overhauled, so you'll need to wait for the Lucene 4.7 release, or just use the current the 4.x sources, to get this example working.

While this example is simple, and works correctly, there are some clear performance improvements that are possible, such as using a bounding box as a fast match to avoid computing Haversine for hits that are clearly outside of the range of possible drill-downs (patches welcome!). Even so, this is a nice step forward for Lucene's faceting and it's amazing that geospatial distance faceting with Lucene can be so simple.

Thursday, December 12, 2013

Fast range faceting using segment trees and the Java ASM library

In Lucene's facet module we recently added support for dynamic range faceting, to show how many hits match each of a dynamic set of ranges. For example, the Updated drill-down in the Lucene/Solr issue search application uses range facets. Another example is distance facets (< 1 km, < 2 km, etc.), where the distance is dynamically computed based on the user's current location. Price faceting might also use range facets, if the ranges cannot be established during indexing.

To implement range faceting, for each hit, we first calculate the value (the distance, the age, the price) to be aggregated, and then lookup which ranges match that value and increment its counts. Today we use a simple linear search through all ranges, which has O(N) cost, where N is the number of ranges.

But this is inefficient!

Segment trees

There are fun data structures like segment trees and interval trees with O(log(N) + M) cost per lookup, where M is the number of ranges that match the given value. I chose to explore segment trees, as Lucene only requires looking up by a single value (interval trees can also efficiently look up all ranges overlapping a provided range) and also because all the ranges are known up front (interval trees also support dynamically adding or removing ranges).

If the ranges will never overlap, you can use a simple binary search; Guava's ImmutableRangeSet takes this approach. However, Lucene's range faceting allows overlapping ranges so we can't do that.

Segment trees are simple to visualize: you "project" all ranges down on top of one another, creating a one-dimensional Venn diagram, to define the elementary intervals. This classifies the entire range of numbers into a minimal number of distinct ranges, each elementary interval, such that all points in each elementary interval always match the same set of input ranges. The lookup process is then a binary search to determine which elementary interval a point belongs to, recording the matched ranges as you recurse down the tree.

Consider these ranges; the lower number is inclusive and the upper number is exclusive:

The elementary intervals (think Venn diagram!) are:

Finally, you build a binary tree on top of the elementary ranges, and then add output range indices to both internal nodes and the leaves of that tree, necessary to prevent adversarial cases that would require too much (O(N^2)) space. During lookup, as you walk down the tree, you gather up the output ranges (indices) you encounter; for our example, each elementary range is assigned the follow range indices as outputs:

  -∞ – 0 →
   0 – 10 → 0
  10 – 15 → 1, 2
  15 – 20 → 1, 2, 3
  20 – 30 → 2, 3
  30 – 40 → 3
  40 – 50 → 3, 4
  50 – 70 → 4
  70 – ∞  →

Some ranges correspond to 1 elementary interval, while other ranges correspond to 2 or 3 or more, in general. Some, 2 in this example, may have no matching input ranges.

Looking up matched ranges

I've pushed all sources described below to new Google code project; the code is still somewhat rough and exploratory, so there are likely exciting bugs lurking, but it does seem to work: it includes (passing!) tests and simple micro-benchmarks.

I started with a basic segment tree implementation as described on the Wikipedia page, for long values, called SimpleLongRangeMultiSet; here's the recursive lookup method:

  private int lookup(Node node, long v, int[] answers, int upto) {
    if (node.outputs != null) {
      for(int range : node.outputs) {
        answers[upto++] = range;
      }
    }
    if (node.left != null) {
      if (v <= node.left.end) {
        upto = lookup(node.left, v, answers, upto);
      } else {
        upto = lookup(node.right, v, answers, upto);
      }
    }

    return upto;
  }

This worked correctly, but I realized there must be non-trivial overhead for the recursion, checking for nulls, the for loop over the output values, etc. Next, I tried switching to parallel arrays to hold the binary tree (ArrayLongRangeMultiSet), where the left child of node N is at 2*N and the right child is at 2*N+1, but this turned out to be slower.

After that I tested a code specializing implementation, first by creating dynamic Java source code from the binary tree. This eliminates the recursion and creates a single simple method that uses a series of if statements, specialized to the specific ranges, to do the binary search and record the range indices. Here's the resulting specialized code, compiled from the above ranges:

  void lookup(long v, int[] answers) {
    int upto = 0;
    if (v <= 19) {
      if (v <= 9) {
        if (v >= 0) {
          answers[upto++] = 0;
          answers[upto++] = 1;
        }
      } else {
        answers[upto++] = 1;
        answers[upto++] = 2;
        if (v >= 15) {
          answers[upto++] = 3;
        }
      }
    } else {
      if (v <= 39) {
        answers[upto++] = 3;
        if (v <= 29) {
          answers[upto++] = 2;
        }
      } else {
        if (v <= 49) {
          answers[upto++] = 3;
          answers[upto++] = 4;
        } else {
          if (v <= 69) {
            answers[upto++] = 4;
          }
        }
      }
    }
  }

Finally, using the ASM library, I compiled the tree directly to specialized Java bytecode, and this proved to be fastest (up to 2.5X faster in some cases).

As a baseline, I also added the simple linear search method, LinearLongRangeMultiSet; as long as you don't have too many ranges (say 10 or less), its performance is often better than the Java segment tree.

The implementation also allows you to specify the allowed range of input values (for example, maybe all values are >=0 in your usage), which can save an if statement or two in the lookup method.

Counting all matched ranges

While the segment tree allows you to quickly look up all matching ranges for a given value, after a nice tip from fellow Lucene committee Robert Muir, we realized Lucene's range faceting does not need to know the ranges for each value; instead, it only requires the aggregated counts for each range in the end, after seeing many values.

This leads to an optimization: compute counts for each elementary interval and then in the end, roll up those counts to get the count for each range. This will only work for single-valued fields, since for a multi-valued field you'd need to carefully never increment the same range more than once per hit.

So based on that approach, I created a new LongRangeCounter abstract base class, and the SimpleLongRangeCounter Java implementation, and also the ASM specialized version, and the results are indeed faster (~20 to 50%) than using the lookup method to count; I'll use this approach with Lucene.

Segment trees are normally always "perfectly" balanced but one final twist I explored was to use a training set of values to bias the order of the if statements. For example, if your ranges cover a tiny portion of the search space, as is the case for the Updated drill-down, then it should be faster to use a slightly unbalanced tree, by first checking if the value is less than the maximum range. However, in testing, while there are some cases where this "training" is a bit faster, often it's slower; I'm not sure why.

Lucene

I haven't folded this into Lucene yet, but I plan to; all the exploratory code lives in the segment-trees Google code project for now.

Results on the micro-benchmarks can be entirely different once the implementations are folded into a "real" search application. While ASM is a powerful way to generate specialized code, and it gives sizable performance gains at least in the micro-benchmarks, it is an added dependency and complexity for ongoing development and many more developers know Java than ASM. It may also confuse hotspot, causing deoptimizations when there are multiple implementations for an abstract base class. Furthermore, if there are many ranges, the resulting specialized bytecode can be become quite large (but, still O(N*log(N)) in size), which may cause other problems. On balance I'm not sure the sizable performance gains (on a micro-benchmark) warrant using ASM in Lucene's range faceting.

Friday, November 29, 2013

Pulling H264 video from an IP camera using Python

IP cameras have come a long ways, and recently I upgraded some old cameras to these new Lorex cameras (model LNB2151/LNB2153) and I'm very impressed.

These cameras record 1080p wide-angle video at 30 frames per second, use power over ethernet (PoE), can see when it's dark using builtin infrared LEDs and are weather-proof. The video quality is impressive and they are surprisingly inexpensive. The camera can deliver two streams at once, so you can pull a lower resolution stream for preview, motion detection, etc., and simultaneously pull the higher resolution stream to simply record it for later scrutinizing.

After buying a few of these cameras I needed a simple way to pull the raw H264 video from them, and with some digging I discovered the cameras speak RTSP and RTP which are standard protocols for streaming video and audio from IP cameras. Many IP cameras have adopted these standards.

Both VLC and MPlayer can play RTSP/RTP video streams; for the Lorex cameras the default URL is:

rtsp://admin:000000@<hostname>/PSIA/Streaming/channels/1.

After more digging I found the nice open-source (LGPL license) Live555 project, which is a C++ library for all sorts of media related protocols, including RTSP, RTP and RTCP. VLC and MPlayer use this library for their RTSP support. Perfect!

My C++ is a bit rusty, and I really don't understand all of Live555's numerous APIs, but I managed to cobble together a simple Python extension module, derived from Live555's testRTSPClient.cpp example, that seems to work well.

I've posted my current source code in a new Google code project named pylive555. It provides a very simple API (only 3 functions!) to pull frames from a remote camera via RTSP/RTP; Live555 has many, many other APIs that I haven't exposed.

The code is thread-friendly (releases the global interpreter lock when invoking the Live555 APIs).

I've included a simple example.py Python program, that shows how to load H264 video frames from the camera and save them to a local file. You could start from this example and modify it to do other things, for example use the ffmpeg H264 codec to decode individual frames, use a motion detection library to trigger recording, parse each frame's metadata to find the keyframes, etc. Here's the current example.py:

import time
import sys
import live555
import threading

# Shows how to use live555 module to pull frames from an RTSP/RTP
# source.  Run this (likely first customizing the URL below:

# Example: python3 example.py 10.17.4.118 1 10 out.264 
if len(sys.argv) != 5:
  print()
  print('Usage: python3 example.py cameraIP channel seconds fileOut')
  print()
  sys.exit(1)
  
cameraIP = sys.argv[1]
channel = sys.argv[2]
seconds = float(sys.argv[3])
fileOut = sys.argv[4]

# NOTE: the username & password, and the URL path, will vary from one
# camera to another!  This URL path works with the Lorex LNB2153:
url = 'rtsp://admin:000000@%s/PSIA/Streaming/channels/%s' % (cameraIP, channel)

fOut = open(fileOut, 'wb')

def oneFrame(codecName, bytes, sec, usec, durUSec):
  print('frame for %s: %d bytes' % (codecName, len(bytes)))
  fOut.write(b'\0\0\0\1' + bytes)

# Starts pulling frames from the URL, with the provided callback:
useTCP = False
live555.startRTSP(url, oneFrame, useTCP)

# Run Live555's event loop in a background thread:
t = threading.Thread(target=live555.runEventLoop, args=())
t.setDaemon(True)
t.start()

endTime = time.time() + seconds
while time.time() < endTime:
  time.sleep(0.1)

# Tell Live555's event loop to stop:
live555.stopEventLoop()

# Wait for the background thread to finish:
t.join()

Installation is very easy; see the README.txt. I've only tested on Linux with Python3.2 and with the Lorex LNB2151 cameras.

I'm planning on installing one of these Lorex cameras inside a bat house that I'll build with the kids this winter. If we're lucky we'll be able to view live bats in the summer!

Tuesday, November 12, 2013

Playing a sound (AIFF) file from Python using PySDL2

Sometimes you need to play sounds or music (digitized samples) from Python, which really ought to be a simple task. Yet it took me a little while to work out, and the resulting source code is quite simple, so I figured I'd share it here in case anybody else is struggling with it.

The Python wiki lists quite a few packages for working with audio, but most of them are overkill for basic audio recording and playback.

For quite some time I had been using PyAudio, which adds Python bindings to the PortAudio project. I really like it because it focuses entirely on recording and playing audio. But, for some reason, when I recently upgraded to Mavericks, it stutters whenever I try to play samples at a sample rate lower than 44.1 KHz. I've emailed the author to try to get to the bottom of it.

In the meantime, I tried a new package, PySDL2, which adds Python bindings to the SDL2 (Simple Directmedia Layer) project.

SDL2 does quite a bit more than basic audio, and I didn't dig into any of that yet. I hit one small issue with PySDL2, but the one-line change in the issue fixes it. Here's the resulting code:

import sdl2
import sys
import aifc
import threading

class ReadAIFF:
  def __init__(self, fileName):
    self.a = aifc.open(fileName)
    self.frameUpto = 0
    self.bytesPerFrame = self.a.getnchannels() * self.a.getsampwidth()
    self.numFrames = self.a.getnframes()
    self.done = threading.Event()
    
  def playNextChunk(self, unused, buf, bufSize):
    framesInBuffer = bufSize/self.bytesPerFrame
    framesToRead = min(framesInBuffer, self.numFrames-self.frameUpto)

    if self.frameUpto == self.numFrames:
      self.done.set()

    # TODO: is there a faster way to copy the string into the ctypes
    # pointer/array?
    for i, b in enumerate(self.a.readframes(framesToRead)):
      buf[i] = ord(b)

    # Play silence after:
    # TODO: is there a faster way to zero out the array?
    for i in range(self.bytesPerFrame*framesToRead, self.bytesPerFrame*framesInBuffer):
      buf[i] = 0

    self.frameUpto += framesToRead

if sdl2.SDL_Init(sdl2.SDL_INIT_AUDIO) != 0:
  raise RuntimeError('failed to init audio')

p = ReadAIFF(sys.argv[1])
spec = sdl2.SDL_AudioSpec(p.a.getframerate(),
                          sdl2.AUDIO_S16MSB,
                          p.a.getnchannels(),
                          512,
                          sdl2.SDL_AudioCallback(p.playNextChunk))

# TODO: instead of passing None for the 4th arg, I really should pass
# another AudioSpec and then confirm it matched what I asked for:
devID = sdl2.SDL_OpenAudioDevice(None, 0, spec, None, 0)
if devID == 0:
  raise RuntimeError('failed to open audio device')

# Tell audio device to start playing:
sdl2.SDL_PauseAudioDevice(devID, 0)

# Wait until all samples are done playing
p.done.wait()

sdl2.SDL_CloseAudioDevice(devID)

The code is straightforward: it loads an AIFF file, using Python's builtin aifc module, and then creates a callback, playNextChunk which is invoked by PySDL2 when it needs more samples to play. So far it seems to work very well!

Saturday, September 28, 2013

Lucene now has an in-memory terms dictionary, thanks to Google Summer of Code

Last year, Han Jiang's Google Summer of Code project was a big success: he created a new (now, default) postings format for substantially faster searches, along with smaller indices.

This summer, Han was at it again, with a new Google Summer of Code project with Lucene: he created a new terms dictionary holding all terms and their metadata in memory as an FST.

In fact, he created two new terms dictionary implementations. The first, FSTTermsWriter/Reader, hold all terms and metadata in a single in-memory FST, while the second, FSTOrdTermsWriter/Reader, does the same but also supports retrieving the ordinal for a term (TermsEnum.ord()) and looking up a term given its ordinal (

TermsEnum.seekExact(long
ord)

). The second one also uses this ord internally so that the FST is more compact, while all metadata is stored outside of the FST, referenced by ord.

Like the default BlockTree terms dictionary, these new terms dictionaries accept any PostingsBaseFormat so you can separately plug in whichever format you want to encode/decode the postings.

Han also improved the PostingsBaseFormat API so that there is now a cleaner separation of how terms and their metadata are encoded vs. how postings are encoded; PostingsWriterBase.encodeTerm and PostingsReaderBase.decodeTerm now handle encoding and decoding any term metadata required by the postings format, abstracting away how the long[]/byte[] were persisted by the terms dictionary. Previously this line was annoyingly blurry.

Unfortunately, while the performance for primary key lookups is substantially faster, other queries e.g. WildcardQuery are slower; see LUCENE-3069 for details. Fortunately, using PerFieldPostingsFormat, you are free to pick and choose which fields (e.g. your "id" field) should use the new terms dictionary.

For now this feature is trunk-only (eventually Lucene 5.0).

Thank you Han and thank you Google!

Monday, September 16, 2013

Three exciting Lucene features in one day

Yesterday was a productive day: suddenly, there are three exciting new features coming to Lucene.

Expressions module

The first feature, committed yesterday, is the new expressions module. This allows you to define a dynamic field for sorting, using an arbitrary String expression. There is builtin support for parsing JavaScript, but the parser is pluggable if you want to create your own syntax.

For example, you could define a sort field using the expression

  sqrt(_score) + ln(popularity)

if you want to offer a blended sort primarily by relevance and boosting by a popularity field.

The code is very easy to use; there are some nice examples in the TestDemoExpressions.java unit test case, and this will be available in Lucene's next stable release (4.6).

Updateable numeric doc-values fields

The second feature, also committed yesterday, is updateable numeric doc-values fields, letting you change previously indexed numeric values using the new updateNumericDocValue method on IndexWriter. It works fine with near-real-time readers, so you can update the numeric values for a few documents and then re-open a new near-real-time reader to see the changes.

The feature is currently trunk only as we work out a few remaining issues involving an particularly controversial boolean. It also currently does not work on sparse fields, i.e. you can only update a document's value if that document had already indexed that field in the first place.

Combined, these two features enable powerful use-cases where you want to sort by a blended field that is changing over time. For example, perhaps you measure how often your users click through each document in the search results, and then use that to update the popularity field, which is then used for a blended sort. This way the rankings of the search results change over time as you learn from users which documents are popular and which are not.

Of course such a feature was always possible before, using custom external code, but with both expressions and updateable doc-values now available it becomes trivial to implement!

Free text suggestions

Finally, the third feature is a new suggester implementation, FreeTextSuggester. It is a very different suggester than the existing ones: rather than suggest from a finite universe of pre-built suggestions, it uses a simple ngram language model to predict the "long tail" of possible suggestions based on the 1 or 2 previous tokens.

Under the hood, it uses ShingleFilter to create the ngrams, and an FST to store and lookup the resulting ngram models. While multiple ngram models are stored compactly in a single FST, the FST can still get quite large; the 3-gram, 2-gram and 1-gram model built on the AOL query logs is 19.4 MB (the queries themselves are 25.4 MB). This was inspired by Google's approach.

Likely this suggester would not be used by itself, but rather as a fallback when your primary suggester failed to find any suggestions; you can see this behavior with Google. Try searching for "the fast and the ", and you will see the suggestions are still full queries. But if the next word you type is "burning" then suddenly google (so far!) does not have a full suggestion and falls back to their free text approach.

Wednesday, August 14, 2013

SuggestStopFilter carefully removes stop words for suggesters

Lucene now has a nice set of suggesters that use an analyzer to tokenize the suggestions: AnalyzingSuggester, FuzzySuggester and AnalyzingInfixSuggester. Using an analyzer is powerful because it lets you customize exactly how suggestions are matched: you can normalize case, apply stemming, match across different synonym forms, etc.

One of the most common things you'll do with your analyzer is to remove stop-words using StopFilter. Unfortunately, if you try this, you'll quickly notice that the stop filter is too aggressive because it happily removes the last token even if the user isn't done typing it yet. For example if the user has typed "a", you'd expect suggestions like apple, aardvark, etc., but you won't get that because StopFilter removed the "a" token.

You could try using StopFilter only while indexing, which was my first attempt with the suggestions at jirasearch.mikemccandless.com, but then, at least for AnalyzingInfixSuggester, you'll fail to get matches when you pass allTermsRequired=true because the suggester then requires that even stop words find matches.

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won't change it. This way a query "a" can find "apple", but a query "a " (with a trailing space) will find nothing because the "a" will be removed.

I've pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

Friday, August 2, 2013

A new version of the Compact Language Detector

It's been almost two years since I originally factored out the fast and accurate Compact Language Detector from the Chromium project, and the effort was clearly worthwhile: the project is popular and others have created additional bindings for languages including at least Perl, Ruby, R, JavaScript, PHP and C#/.NET.

Eric Fischer used CLD to create the colorful Twitter language map, and since then further language maps have appeared, e.g. for New York and London. What a multi-lingual world we live in!

Suddenly, just a few weeks ago, I received an out-of-the-blue email from Dick Sites, creator of CLD, with great news: he was finishing up version 2.0 of CLD and had already posted the source code on a new project.

So I've now reworked the Python bindings and ported the unit tests to Python (they pass!) to take advantage of the new features. It was much easier this time around since the CLD2 sources were already pulled out into their own project (thank you Dick and Google!).

There are a number of improvements over the previous version of CLD:

Improved accuracy.
Upgraded to Unicode 6.2 characters.
More languages detected: 83 languages, up from 64 previously.
A new "full language table" detector, available in Python as a separate cld2full module, that detects 161 languages. This increases the C library size from 1.8 MB (for 83 languages) to 5.5 MB (for 161 languages). Details are here.
An option to identify which parts (byte ranges) of the text contain which language, in case the application needs to do further language-specific processing. From Python, pass the optional returnVectors=True argument to get the byte ranges, but note that this requires additional non-trivial CPU cost. This wiki page shows very interesting statistics on how frequently different languages appear in one page, across top web sites, showing the importance of handling multiple languages in a single text input.
A new hintLanguageHTTPHeaders parameter, which you can pass from the Content-Language HTTP header. Also, CLD2 will spot any lang=X attribute inside the <html> tag itself (if you pass it HTML).

In the new Python bindings, I've exposed CLD2's debug* flags, to add verbosity to CLD2's detection process. This document describes how to interpret the resulting output.

The detect function returns up to 3 top detected languages. Each detected language includes the percent of the text that was detected as the language, and a confidence score. The function no longer returns a single "picked" summary language, and the pickSummaryLanguage option has been removed: this option was apparently present for internal backwards compatibility reasons and did not improve accuracy.

Remember that the provided input must be valid UTF-8 bytes, otherwise all sorts of things could go wrong (wrong results, segmentation fault).

To see the list of detected languages, just run this

python -c
"import cld2; print cld2.DETECTED_LANGUAGES"

, or

python -c
"import cld2full; print cld2full.DETECTED_LANGUAGES"

to see the full set of languages.

The README gives details on how to build and install CLD2.

Once again, thank you Google, and thank you Dick Sites for making this very useful library available to the world as open-source.

Saturday, June 22, 2013

2X faster PhraseQuery with Lucene using C++ via JNI

I recently described the new lucene-c-boost github project, which provides amazing speedups (up to 7.8X faster) for common Lucene query types using specialized C++ implementations via JNI.

The code works with a stock Lucene 4.3.0 JAR and default codec, and has a trivial API: just call NativeSearch.search instead of IndexSearcher.search.

Now, a quick update: I've optimized PhraseQuery now as well:

Task	QPS base	StdDev base	QPS opt	StdDev opt	% change
HighPhrase	3.5	(2.7%)	6.5	(0.4%)	1.9 X
MedPhrase	27.1	(1.4%)	51.9	(0.3%)	1.9 X
LowPhrase	7.6	(1.7%)	16.4	(0.3%)	2.2 X

~2X speedup (~90% - ~119%) is nice!

Again, it's great to see a reduced variance on the runtimes since hotspot is mostly not an issue. It's odd that LowPhrase gets slower QPS than MedPhrase: these queries look mis-labelled (I see the LowPhrase queries getting more hits than MedPhrase!).

All changes have been pushed to lucene-c-boost; next I'd like to figure out how to get facets working.

A new Lucene suggester based on infix matches

Suggest, sometimes called auto-suggest, type-ahead search or auto-complete, is now an essential search feature ever since Google added it almost 5 years ago.

Lucene has a number of implementations; I previously described AnalyzingSuggester. Since then, FuzzySuggester was also added, which extends AnalyzingSuggester by also accepting mis-spelled inputs.

Here I describe our newest suggester: AnalyzingInfixSuggester, now going through iterations on the LUCENE-4845 Jira issue.

Unlike the existing suggesters, which generally find suggestions whose whole prefix matches the current user input, this suggester will find matches of tokens anywhere in the user input and in the suggestion; this is why it has Infix in its name.

You can see it in action at the example Jira search application that I built to showcase various Lucene features.

For example, if you enter japan you should see various issues suggested, including:

SOLR-4945: Japanese Autocomplete and Highlighter broken
LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
LUCENE-3921: Add decompose compound Japanese Katakana token capability to Kuromoji

As you can see, the incoming characters can match not just the prefix of each suggestion but also the prefix of any token within.

Unlike the existing suggesters, this new suggester does not use a specialized data-structure such as FSTs. Instead, it's an "ordinary" Lucene index under-the-hood, making use of EdgeNGramTokenFilter to index the short prefixes of each token, up to length 3 by default, for fast prefix querying.

It also uses the new index sorter APIs to pre-sort all postings by suggested weight at index time, and at lookup time uses a custom Collector to stop after finding the first N matching hits since these hits are the best matches when sorting by weight. The lookup method lets you specify whether all terms must be found, or any of the terms (Jira search requires all terms).

Since the suggestions are sorted solely by weight, and no other relevance criteria, this suggester is a good fit for applications that have a strong a-priori weighting for each suggestion, such as a movie search engine ranking suggestions by popularity, recency or a blend, for each movie. In Jira search I rank each suggestion (Jira issue) by how recently it was updated.

Specifically, there is no penalty for suggestions with matching tokens far from the beginning, which could mean the relevance is poor in some cases; an alternative approach (patch is on the issue) uses FSTs instead, which can require that the matched tokens are within the first three tokens, for example. This would also be possible with AnalyzingInfixSuggester using an index-time analyzer that dropped all but the first three tokens.

One nice benefit of an index-based approach is AnalyzingInfixSuggester handles highlighting of the matched tokens (red color, above), which has unfortunately proven difficult to provide with the FST-based suggesters. Another benefit is, in theory, the suggester could support near-real-time indexing, but I haven't exposed that in the current patch and probably won't for some time (patches welcome!).

Performance is reasonable: somewhere between AnalyzingSuggester and FuzzySuggester, between 58 - 100 kQPS (details on the issue).

Analysis fun

As with AnalyzingSuggester, AnalyzingInfixSuggester let's you separately configure the index-time vs. search-time analyzers. With Jira search, I enabled stop-word removal at index time, but not at search time, so that a query like or would still successfully find any suggestions containing words starting with or, rather than dropping the term entirely.

Which suggester should you use for your application? Impossible to say! You'll have to test each of Lucene's offerings and pick one. Auto-suggest is an area where one-size-does-not-fit-all, so it's great that Lucene is picking up a number of competing implementations. Whichever you use, please give us feedback so we can further iterate and improve!