Wednesday, January 8, 2014

Geospatial (distance) faceting using Lucene's dynamic range facets

There have been several recent, quiet improvements to Lucene that, taken together, have made it surprisingly simple to add geospatial distance faceting to any Lucene search application, for example:
  < 1 km (147)
  < 2 km (579)
  < 5 km (2775)
Such distance facets, which allow the user to quickly filter their search results to those that are close to their location, has become especially important lately since most searches are now from mobile smartphones.

In the past, this has been challenging to implement because it's so dynamic and so costly: the facet counts depend on each user's location, and so cannot be cached and shared across users, and the underlying math for spatial distance is complex.

But several recent Lucene improvements now make this surprisingly simple!

First, Lucene's dynamic range faceting has been generalized to accept any ValueSource, not just a numeric doc values field from the index. Thanks to the recently added expressions module, this means you can offer dynamic range facets computed from an arbitrary JavaScript expression, since the expression is compiled on-the-fly to a ValueSource using custom generated Java bytecodes with ASM. Lucene's range faceting is also faster now, using segment trees to quickly assign each value to the matching ranges.

Second, the Haversine distance function was added to the expressions module. The implementation uses impressively fast approximations to the normally costly trigonometric functions, poached in part from the Java Optimized Development Kit project, without sacrificing too much accuracy. It's unlikely the approximations will ever matter in practice, and there is an open issue to further improve the approximation.

Suddenly, armed with these improvements, if you index latitude and longitude as DoubleDocValuesFields in each document, and you know the user's latitude/longitude location for each request, you can easily compute facet counts and offer drill-downs by any set of chosen distances.

First, index your documents with latitude/longitude fields:
1
2
3
4
Document doc = new Document();
doc.add(new DoubleField("latitude", 40.759011, Field.Store.NO));
doc.add(new DoubleField("longitude", -73.9844722, Field.Store.NO));
writer.addDocument(doc);
At search time, obtain the ValueSource by building a dynamic expression that invokes the Haversine function:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
private ValueSource getDistanceValueSource() {
  Expression distance;
  try {
    distance = JavascriptCompiler.compile(
                 "haversin(40.7143528,-74.0059731,latitude,longitude)");
  } catch (ParseException pe) {
    // Should not happen
    throw new RuntimeException(pe);
  }
  SimpleBindings bindings = new SimpleBindings();
  bindings.add(new SortField("latitude", SortField.Type.DOUBLE));
  bindings.add(new SortField("longitude", SortField.Type.DOUBLE));

  return distance.getValueSource(bindings);
}
Instead of the hardwired latitude/longitude above, you should fill in the user's location.

Using that ValueSource, compute the dynamic facet counts like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
FacetsCollector fc = new FacetsCollector();

searcher.search(new MatchAllDocsQuery(), fc);

Facets facets = new DoubleRangeFacetCounts(
                    "field",
                    getDistanceValueSource(), fc,
                    ONE_KM,
                    TWO_KM,
                    FIVE_KM,
                    TEN_KM);

return facets.getTopChildren(10, "field");
Normally you'd use a "real" query instead of the top-level-browse MatchAllDocsQuery. Finally, once the user picks a distance for drill-down, use the Range.getFilter method and add that to a DrillDownQuery using ConstantScoreQuery:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public TopDocs drillDown(DoubleRange range) throws IOException {
  // Passing no baseQuery means we drill down on all
  // documents ("browse only"):
  DrillDownQuery q = new DrillDownQuery(null);

  q.add("field", new ConstantScoreQuery(
                     range.getFilter(getDistanceValueSource())));

  return searcher.search(q, 10);
}
See the full source code here, from the lucene/demo module.

When I first tested this example, there was a fun bug, and then later the facet APIs were overhauled, so you'll need to wait for the Lucene 4.7 release, or just use the current the 4.x sources, to get this example working.

While this example is simple, and works correctly, there are some clear performance improvements that are possible, such as using a bounding box as a fast match to avoid computing Haversine for hits that are clearly outside of the range of possible drill-downs (patches welcome!). Even so, this is a nice step forward for Lucene's faceting and it's amazing that geospatial distance faceting with Lucene can be so simple.

16 comments:

  1. Hey Mike, so quick clarification on the indexing portion? I thought from reading the faceted filtering doc that you had to index facets in a separate "taxonomy" index which is then what is used for faceted filtering and faceted counts. Is that not the case here or am I misunderstanding? Great post though, thanks, -Chris

    ReplyDelete
    Replies
    1. Hi Chris,

      There are different facet methods in the facets module; some use the taxonomy index, but others, like dynamic range faceting used here, do not. For dynamic range faceting you just need a source of values (e.g. a previously indexed doc values field, or an expression).

      There's also SortedSetDocValues faceting, which rely on sorted-set doc values, not the taxonomy index.

      Delete
    2. Hey Mike, can this dynamic range faceting or SortedSetDocValue faceting be done in Solr plugins? How about aggregating faceted counts? How would that adjust the index building and search components to make use of these features?
      Thanks,
      -Chris

      Delete
    3. Unfortunately the Lucene facets module has not yet been integrated into Solr ... perhaps it could be done as a plugin, but I don't have much experience with Solr's plugins...

      Delete
  2. Thanks for posting Michael. I've been trying to decide between Lucene and mongoDB for geo proximity search for a while. Still on the fence as I haven't done any performance tests, but based on my experience with Lucene I think we may have a winner. ;)

    ReplyDelete
  3. Hi Michael,

    I want to use DrillDownQuery once the user picks a distance for drill-down. Where can I found some examples? Or can you give some? The examples that' i've found are not always adapted for the new lucene's version 4.7.2. FacetResultNode are not present anymore etc..
    Thks for your work and for your reply!
    Miriam.

    ReplyDelete
    Replies
    1. Have a look at the lucene/demo module on recent Lucene releases: it has DistanceFacetsExample.java

      Delete
    2. Thanks for your reply.
      I've downloaded DistanceFacetsExample.java on my eclipse environment and run it.
      The thing that I don't understand is how can I retrieve latitude/longitude fields after the query?
      Thank you for your help. Miriam

      Delete
    3. You could index them as stored fields? (Just pass Field.Store.YES when you create the DoubleField), and then retrieve each hit at search time.

      Delete
    4. So, I pass Field.Store.YES, after that, I worked with scoreDocs. I finally retrieve my stored fields with the doc. Thks for your help!
      Can you give me your opinion about that: I am developing an android app that range by distance articles geolocalized. The app displays the different change of distance: 1km 2km etc.. when users scroll the articles. So I have to range distances and display dynamically articles following the scroll user.
      Do you think that using lucene facets for my app can be a good solution?
      Thank you for all the help that you could bring me.
      Best regards.
      Miriam.

      Delete
    5. Well, Lucene isn't "officially" supported on Android, since Android runs only a subset of Java. That said, it may just work. Try it and see?

      Delete
    6. Yes I confirm your reply ^^ I spent many days on my android project, trying to run lucene facets and it takes too long and makes my eclipse crash ^^.
      Is it the same thing with elasticsearch? Do you know another tool or library that I can use for android? Many thks for your reply.
      Miriam

      Delete
    7. Lucene facets should not take so long and make your eclipse crash! Can you send some details about that to the users list? (java-user@lucene.apache.org).

      Delete
  4. does anyone have a working link to this source code. I need to see how the ranges are constructed.

    ReplyDelete
  5. I found it. Sorry I know this thread is old but it still really helpful. https://lucene.apache.org/core/4_7_0/demo/src-html/org/apache/lucene/demo/facet/DistanceFacetsExample.html

    ReplyDelete