Monday, May 14, 2012

Finite State Automata in Lucene

Lucene Revolution 2012 is now done, and the talk Robert and I gave went well! We showed how we are using automata (FSAs and FSTs) to make great improvements throughout Lucene.

You can view the slides here.

This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to "save" your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.

Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!

Monday, April 30, 2012

Lucene's TokenStreams are actually graphs!

Lucene's TokenStream class produces the sequence of tokens to be indexed for a document's fields. The API is an iterator: you call incrementToken to advance to the next token, and then query specific attributes to obtain the details for that token. For example, CharTermAttribute holds the text of the token; OffsetAttribute has the character start and end offset into the original string corresponding to this token, for highlighting purposes. There are a number of standard token attributes, and some tokenizers add their own attributes.

The TokenStream is actually a chain, starting with a Tokenizer that splits characters into initial tokens, followed by any number of TokenFilters that modify the tokens. You can also use a CharFilter to pre-process the characters before tokenization, for example to strip out HTML markup, remap character sequences or replace characters according to a regular expression, while preserving the proper offsets back into the original input string. Analyzer is the factory class that creates TokenStreams when needed.

Lucene and Solr have a wide variety of Tokenizers and TokenFilters, including support for at least 34 languages.

Let's tokenize a simple example: fast wi fi network is down. Assume we preserve stop words. When viewed as a graph, the tokens look like this:



Each node is a position, and each arc is a token. The TokenStream enumerates a directed acyclic graph, one arc at a time.

Next, let's add SynoynmFilter into our analysis chain, applying these synonyms:
  • fastspeedy
  • wi fiwifi
  • wi fi networkhotspot
resulting in this graph:



Now the graph is more interesting! For each token (arc), the PositionIncrementAttribute tells us how many positions (nodes) ahead this arc starts from, while the new (as of 3.6.0) PositionLengthAttribute tells us how many positions (nodes) ahead the arc arrives to.

Besides SynonymFilter, several other analysis components now produce token graphs. Kuromoji's JapaneseTokenizer outputs the decompounded form for compound tokens. For example, tokens like ショッピングセンター (shopping center) will also have an alternate path with ショッピング (shopping) followed by センター (center). Both ShingleFilter and CommonGramsFilter set the position length to 2 when they merge two input tokens.

Other analysis components should produce a graph but don't yet (patches welcome!): WordDelimiterFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, NGramTokenFilter, EdgeNGramTokenFilter, and likely others.

Limitations

There are unfortunately several hard-to-fix problems with token graphs. One problem is that the indexer completely ignores PositionLengthAttribute; it only pays attention to PositionIncrementAttribute. This means the indexer acts as if all arcs always arrive at the very next position, so for the above graph we actually index this:



This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi"). Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage. This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.

QueryParser also ignores position length, however this should be easier to fix. This would mean you can run graph analyzers at query time (i.e., query time expansion) and get the correct results.

Another problem is that SynonymFilter also unexpectedly performs its own form of sausagization when the injected synonym is more than one token. For example if you have this rule:
  • dnsdomain name service
it results in graphs like this:



Notice how name was overlapped onto is, and service was overlapped onto up. It's an odd word salad!

This of course also messes up phrase queries ("domain name service is up" should match but doesn't, while "dns name up" shouldn't match but does). To work around this problem you should ensure all of your injected synonyms are single tokens! For this case, you could run the reverse mapping (domain name servicedns) at query time (as well as indexing time) and then both queries dns and domain name service will match any document containing either variant.

This happens because SynonymFilter never creates new positions; if it did so, it could make new positions for tokens in domain name service, and then change dns to position length 3.

Another problem is that SynonymFilter, like the indexer, also ignores the position length of the incoming tokens: it cannot properly consume a token graph. So if you added a second SynonymFilter it would fail to match hotspot is down.

We've only just started but bit by bit our token streams are producing graphs!

Saturday, April 28, 2012

Lucene has two Google Summer of Code students!

I'm happy to announce that two Lucene Google Summer of Code projects were accepted for this summer!

The first project (LUCENE-3312), proposed by Nikola Tanković, will separate StorableField out of IndexableField, and also fix the longstanding confusing trap that one can retrieve a Document at search time and re-index it, without losing anything. It's unfortunately not true!

The second project (LUCENE-3892), proposed by Han Jiang, will create a postings format using PForDelta compression (PDF). Long ago we saw great performance from PForDelta, but with a hacked up prototype patch that couldn't be committed. The goal of this project is to implement PForDelta "for real" and characterize the resulting performance.

Welcome aboard Nikola and Han!

Wednesday, March 14, 2012

New index statistics in Lucene 4.0

In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.

Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.

To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":
  • document 1: The Lion, the Witch, and the Wardrobe
  • document 2: The Da Vinci Code
Assume we tokenize on whitespace, commas are removed, all terms are downcased and we don't discard stop-words. Here are the statistics Lucene tracks:
    TermsEnum.docFreq()
How many documents contain at least one occurrence of the term in the field; 3.x indices also save this (TermEnum.docFreq()). For term "lion" docFreq is 1, and for term "the" it's 2.

    Terms.getSumDocFreq()
Number of postings, i.e. sum of TermsEnum.docFreq() across all terms in the field. For our example documents this is 9.

    TermsEnum.totalTermFreq()
Number of occurrences of this term in the field, across all documents. For term "the" it's 4, for term "vinci" it's 1.

    Terms.getSumTotalTermFreq()
Number of term occurrences in the field, across all documents; this is the sum of TermsEnum.totalTermFreq() across all unique terms in the field. For our example documents this is 11.

    Terms.getDocCount()
How many documents have at least one term for this field. In our example documents, this is 2, but if for example one of the documents was missing the title field, it would be 1.

    Terms.getUniqueTermCount()
How many unique terms were seen in this field. For our example documents this is 8. Note that this statistic is of limited utility for scoring, because it's only available per-segment and you cannot (efficiently!) compute this across all segments in the index (unless there is only one segment).

    Fields.getUniqueTermCount()
Number of unique terms across all fields; this is the sum of Terms.getUniqueTermCount() across all fields. In our example documents this is 8. Note that this is also only available per-segment.

    Fields.getUniqueFieldCount()
Number of unique fields. For our example documents this is 1; if we also had a body field and an abstract field, it would be 3. Note that this is also only available per-segment.

3.x indices only store TermsEnum.docFreq(), so if you want to experiment with the new scoring models in Lucene 4.0, you should either re-index or upgrade your index using IndexUpgrader. Note that the new scoring models all use the same single-byte norms format, so you can freely switch between them without re-indexing.

In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the FieldInvertState passed to Similarity.computeNorm method for both 3.x and 4.0:
    length
How many tokens in the document. For document 1 it's 7; for document 2 it's 4.

    uniqueTermCount
For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.

    maxTermFrequency
What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.

In 3.x, if you want to consume these indexing-time statistics, you'll have to save them away yourself (e.g., somehow encoding them into the single-byte norm value). However, since 4.0 uses doc values for norms, you have more freedom to encode these statistics however you'd like. Your custom similarity can then pull from these.

From these available statistics you're now free to derive other commonly used statistics:
  • Average field length across all documents is Terms.getSumTotalTermFreq() divided by maxDoc (or Terms.getDocCount(), if not all documents have the field).

  • Average within-document field term frequency is FieldInvertState.length divided by FieldInvertState.uniqueTermCount.

  • Average number of unique terms per field across all documents is Terms.getSumDocFreq() divided by maxDoc (or Terms.getDocCount(field), if not all documents have the field).
Remember that the statistics do not reflect deleted documents, until those documents are merged away; in general this also means that segment merging will alter scores! Similarly, if the field omits term frequencies, then the statistics will not be correct (though they will still be consistent with one another: we will pretend each term occurred once per document).

Monday, March 5, 2012

Crowd-data can find drug and vaccine side effects

The social crowd has proven to be powerful, if you can find some way to harness it: crowd-sourcing can perform tasks and solve collaborative problems, crowd-funding can raise substantial financing.

I suspect crowd-data will similarly become an effective way to create large, realistic databases.

A great application of this is the medical world, where many people post to health forums raising medical problems, possible side effects from drugs and vaccines, etc. Why not collect all such posts to find previously undiscovered problems? In fact, this paper describes just that: the authors extracted the nasty side effects of statin drugs based on posts to online health forums. Similarly, this abstract describes a system that used crowd-data to spot nasty side effects from Singulair, years before the FDA issued a warning. The VAERS database, which gathers parent-reported problems after children receive vaccines, is another example.

Unfortunately the drug safety trials that take place before a drug can be released are not especially trustworthy. Here's a scary quote from that interview:

    When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.

And that was 7 years ago! I imagine the situation has only gotten worse.

When a new drug is released, the true, unbiased drug trial begins when millions of guinea-pigs start taking taking it. Crowd-data makes it possible to draw conclusions from that that post-market drug trial.

Of course there are challenging tradeoffs: crowd-data, being derived from "ordinary people" without any rigorous standard collection process, can be dirty, incomplete and reflect sampling bias (only people experiencing nasty side effects speak up). For these reasons, old-fashioned journals turn their noses up at papers drawing conclusions from crowd-data.

Nevertheless, I believe such limitations are more than offset by the real-time nature and shear scale the millions of people, constantly posting information over time. Inevitably, trustworthy patterns will emerge over the noise. Unlike the synthetic drug trial, this data is as real as you can get: sure, perhaps the drug seemed fine in the carefully controlled pre-market testing, but then out in the real world, unexpected interactions can suddenly emerge. Crowd-data will enable us to find such cases quickly and reliably, as long as we still have enough willing guinea-pigs!

Fast forward a few years and I expect crowd-data will be an excellent means of drawing conclusions, and will prove more reliable than the company-funded pre-market drug trials.

Thursday, March 1, 2012

Transactional Lucene

Many users don't appreciate the transactional semantics of Lucene's APIs and how this can be useful in search applications. For starters, Lucene implements ACID properties:
  • Atomicity: when you make changes (adding, removing documents) in an IndexWriter session, and then commit, either all (if the commit succeeds) or none (if the commit fails) of your changes will be visible, never something in-between. Some methods have their own atomic behavior: if you call updateDocument, which is implemented as a delete followed by an add, you'll never see the delete without the add, even if you open a near-real-time (NRT) reader or commit from a separate thread. Similarly if you add a block of documents, using the relatively new addDocuments method, you'll see either none or all of the documents in any reader you obtain.

  • Consistency: if the computer or OS crashes, or the JVM crashes or is killed, or power is lost, your index will remain intact (ie, not corrupt). Note that other problems, such as bad RAM, a bit-flipping CPU or file system corruption, can still easily corrupt the index!

  • Isolation: while IndexWriter is making changes, nothing is visible to any IndexReader searching the index, until you commit or open a new NRT reader. Only one IndexWriter instance at a time can change the index.

  • Durability: once commit returns, all changes have been written to durable storage (assuming your I/O system correctly implements fsync). If the computer or OS crashes, or the JVM crashes or is killed, or power is lost to the computer, all changes will still be present in the index.

Lucene provides a two-phased commit API: call the prepareCommit method to do all of the hard work (applying buffered deletes, writing buffered documents, fsyncing files). If something is going to go wrong (e.g., disk fills up) it'll almost certainly happen during this first phase. Then, call commit to complete the transaction.

When you close the IndexWriter, it calls commit under the hood. If, instead, you want to discard all changes since the last commit, call the rollback method instead, which also closes the writer. You can even rollback a CREATE: if you have an existing index, and you open an IndexWriter on it with OpenMode.CREATE, and then rollback, the index will be unchanged. Likewise, if you call deleteAll and then rollback.

Note that merely opening an IndexWriter on a new directory does not create an empty commit; ie, you cannot open an IndexReader on the directory until you've called commit yourself.

Lucene does not implement a transaction log itself, but it's easy to build that layer out on top. For example, popular search servers such as Solr and ElasticSearch, do so.

Multiple commits in one index

A single Lucene index is free to contain more than one commit; this is a powerful yet often overlooked feature. Each commit holds a point-in-time view of the index as it existed when the commit was created.

This is similar to the snapshots and writable clones available in modern filesystems like ZFS and the up-and-coming Btrfs. In fact, Lucene is able to efficiently expose multiple commits for the very same underlying reason: all index segments and files are write-once, just like the file blocks in ZFS and Btrfs.

To save multiple commits in your index, just implement your own IndexDeletionPolicy and pass it to IndexWriter. This is the class Lucene uses to know which commits should be deleted: IndexWriter invokes it on opening an index and whenever a commit succeeds. The default policy, KeepOnlyLastCommitDeletionPolicy, deletes all but the last commit. If you use NoDeletionPolicy then every commit is retained!

You can pass a userData (Map<String,String>) to commit, to record custom information (opaque to Lucene) about that commit, and then use IndexReader.listCommits to find all commits in the index. Once you've found a commit, you can open an IndexReader on it to search the index as of that commit.

You can also open an IndexWriter on a prior commit, to effectively roll back all changes after it: this is just like the rollback method, except it enables you to rollback across commits and not just the changes made in the current IndexWriter session.

Old commits are still kept even when you open an index with OpenMode.CREATE. It's also fine to pass OpenMode.CREATE when IndexReaders are still searching the old commits. This enables fun use cases, such as fully re-indexing your content between each commit without affecting any open readers.

Combining all of these fun transactional features, you can do some cool things:

  • Hot backups, using SnapshotDeletionPolicy or PersistentSnapshotDeletionPolicy: these deletion policies make it trivial to take a "live" backup of the index without blocking ongoing changes with IndexWriter. The backup can easily be incremental (just copy the new files, remove the deleted ones), and you can freely throttle the IO to minimize any interference with searching.

  • Searching different catalog versions: perhaps you run an e-commerce site, and but you ship multiple versions of your catalog. In this case you can keep older commits around, each searching a specific version of your catalog, enabling users to choose which catalog to search.

  • Repeatable indexing tests from the same initial index: maybe you want to run a bunch of performance tests, perhaps trying different RAM buffer sizes or merge factors, starting from a large initial index. To do this, simply run each test, but in the end, instead of closing the IndexWriter, use the rollback method to quickly return the index to its initial state, ready for the next test.

  • Force all index segments to be merged down to a single segment, but also keep the prior multi-segment commit. Then you can do tests to compare multi-segment vs single-segment performance.

  • Indexing and searching over the NFS file system: because NFS does not protect still-open files from deletion, you must use an IndexDeletionPolicy to keep each commit around until all open readers have finished with the commit (ie, reopened to a newer commit). The simple approach is time-based, for example: don't delete the commit until it is 15 minutes old, and then always reopen your readers every 5 minutes. Without this you'll hit all sorts of scary exceptions when searching over NFS.

  • Distributed commit: if you have other resources that must commit atomically along with the changes to your Lucene index, you can use the two-phased commit API. This is simple, but vulnerable to failures during the 2nd phaes; to also recover from such cases, for example if Lucene completed its 2nd phase commit but the database's 2nd phase hit some error or crash or power loss, you can easily rollback Lucene's commit by opening an IndexWriter on the prior commit.

  • Experimental index changes: maybe you want to try re-indexing some subset of your index in a new way, but you're not sure it'll work out. In this case, just keep the old commit around, and then rollback if it didn't work out, or delete the old commit if it did.

  • Time-based snapshots: maybe you'd like the freedom to roll back to your index as it existed 1 day ago, 1 week ago, 1 month ago, etc., so you preserve commits based on their age.
Remember that keeping more than one commit alive will necessarily consume additional disk space, however, the overhead is often small since the multiple commits will usually share common segments, especially the larger, older ones.

Saturday, January 14, 2012

ToChildBlockJoinQuery in Lucene

In my last post I described a known limitation of BlockJoinQuery: it joins in only one direction (from child to parent documents). This can be a problem because some applications need to join in reverse (from parent to child documents) instead.

This is now fixed! I just committed a new query, ToChildBlockJoinQuery, to perform the join in the opposite direction. I also renamed the previous query to ToParentBlockJoinQuery.

You use it just like BlockJoinQuery, except in reverse: it wraps any other Query matching parent documents and translates it into a Query matching child documents. The resulting Query can then be combined with other queries against fields in the child documents, and you can then sort by child fields as well.

Using songs and albums as an example: imagine you index each song (child) and album (parent) as separate documents in a single document block. With ToChildBlockJoinQuery, you can now run queries like:
  albumName:thunder AND songName:numb
or
  albumName:thunder, sort by songTitle
Any query with constraints against album and/or song fields will work, and the returned hits will be individual songs (not grouped).

ToChildBlockJoinQuery will be available in Lucene 3.6.0 and 4.0.

Sunday, January 8, 2012

Searching relational content with Lucene's BlockJoinQuery

Lucene's 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content.

Most search engines can't directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.

Perhaps the PDF documents you need to search are immense, so you break them up and index each section as a separate Lucene document; in this case you'll have common fields (title, abstract, author, date published, etc.) for the overall document, joined to the sub-document (section) with its own fields (text, page number, etc.). XML documents typically contain nested tags, representing joined sub-documents; emails have attachments; office documents can embed other documents. Nearly all search domains have some form of relational content, often requiring more than one join.

If such content is so common then how do search applications handle it today?

One obvious "solution" is to simply use a relational database instead of a search engine! If relevance scores are less important and you need to do substantial joining, grouping, sorting, etc., then using a database could be best overall. Most databases include some form a text search, some even using Lucene.

If you still want to use a search engine, then one common approach is to denormalize the content up front, at index-time, by joining all tables and indexing the resulting rows, duplicating content in the process. For example, you'd index each song as a Lucene document, copying over all fields from the song's joined album and artist/band. This works correctly, but can be horribly wasteful as you are indexing identical fields, possibly including large text fields, over and over.

Another approach is to do the join yourself, outside of Lucene, by indexing songs, albums and artist/band as separate Lucene documents, perhaps even in separate indices. At search-time, you first run a query against one collection, for example the songs. Then you iterate through all hits, gathering up (joining) the full set of corresponding albums and then run a second query against the albums, with a large OR'd list of the albums from the first query, repeating this process if you need to join to artist/band as well. This approach will also work, but doesn't scale well as you may have to create possibly immense follow-on queries.

Yet another approach is to use a software package that has already implemented one of these approaches for you! elasticsearch, Apache Solr, Apache Jackrabbit, Hibernate Search and many others all handle relational content in some way.

With BlockJoinQuery you can now directly search relational content yourself!

Let's work through a simple example: imagine you sell shirts online. Each shirt has certain common fields such as name, description, fabric, price, etc. For each shirt you have a number of separate stock keeping units or SKUs, which have their own fields like size, color, inventory count, etc. The SKUs are what you actually sell, and what you must stock, because when someone buys a shirt they buy a specific SKU (size and color).

Maybe you are lucky enough to sell the incredible Mountain Three-wolf Moon Short Sleeve Tee, with these SKUs (size, color):
  • small, blue
  • small, black
  • medium, black
  • large, gray
Perhaps a user first searches for "wolf shirt", gets a bunch of hits, and then drills down on a particular size and color, resulting in this query:
   name:wolf AND size=small AND color=blue
which should match this shirt. name is a shirt field while the size and color are SKU fields.

But if the user drills down instead on a small gray shirt:
   name:wolf AND size=small AND color=gray
then this shirt should not match because the small size only comes in blue and black.

How can you run these queries using BlockJoinQuery? Start by indexing each shirt (parent) and all of its SKUs (children) as separate documents, using the new IndexWriter.addDocuments API to add one shirt and all of its SKUs as a single document block. This method atomically adds a block of documents into a single segment as adjacent document IDs, which BlockJoinQuery relies on. You should also add a marker field to each shirt document (e.g. type = shirt), as BlockJoinQuery requires a Filter identifying the parent documents.

To run a BlockJoinQuery at search-time, you'll first need to create the parent filter, matching only shirts. Note that the filter must use FixedBitSet under the hood, like CachingWrapperFilter:
  Filter shirts = new CachingWrapperFilter(
                    new QueryWrapperFilter(
                      new TermQuery(
                        new Term("type", "shirt"))));
Create this filter once, up front and re-use it any time you need to perform this join.

Then, for each query that requires a join, because it involves both SKU and shirt fields, start with the child query matching only SKU fields:
  BooleanQuery skuQuery = new BooleanQuery();
  skuQuery.add(new TermQuery(new Term("size", "small")), Occur.MUST);
  skuQuery.add(new TermQuery(new Term("color", "blue")), Occur.MUST);
Next, use BlockJoinQuery to translate hits from the SKU document space up to the shirt document space:
  BlockJoinQuery skuJoinQuery = new BlockJoinQuery(
    skuQuery, 
    shirts,
    ScoreMode.None);
The ScoreMode enum decides how scores for multiple SKU hits should be aggregated to the score for the corresponding shirt hit. In this query you don't need scores from the SKU matches, but if you did you can aggregate with Avg, Max or Total instead.

Finally you are now free to build up an arbitrary shirt query using skuJoinQuery as a clause:
  BooleanQuery query = new BooleanQuery();
  query.add(new TermQuery(new Term("name", "wolf")), Occur.MUST);
  query.add(skuJoinQuery, Occur.MUST);
You could also just run skuJoinQuery as-is if the query doesn't have any shirt fields.

Finally, just run this query like normal! The returned hits will be only shirt documents; if you'd also like to see which SKUs matched for each shirt, use BlockJoinCollector:
  BlockJoinCollector c = new BlockJoinCollector(
    Sort.RELEVANCE, // sort
    10,             // numHits
    true,           // trackScores
    false           // trackMaxScore
    );
  searcher.search(query, c);
The provided Sort must use only shirt fields (you cannot sort by any SKU fields). When each hit (a shirt) is competitive, this collector will also record all SKUs that matched for that shirt, which you can retrieve like this:
  TopGroups hits = c.getTopGroups(
    skuJoinQuery,
    skuSort,
    0,   // offset
    10,  // maxDocsPerGroup
    0,   // withinGroupOffset
    true // fillSortFields
  );
Set skuSort to the sort order for the SKUs within each shirt. The first offset hits are skipped (use this for paging through shirt hits). Under each shirt, at most maxDocsPerGroup SKUs will be returned. Use withinGroupOffset if you want to page within the SKUs. If fillSortFields is true then each SKU hit will have values for the fields from skuSort.

The hits returned by BlockJoinCollector.getTopGroups are SKU hits, grouped by shirt. You'd get the exact same results if you had denormalized up-front and then used grouping to group results by shirt.

You can also do more than one join in a single query; the joins can be nested (parent to child to grandchild) or parallel (parent to child1 and parent to child2).

However, there are some important limitations of index-time joins:
  • The join must be computed at index-time and "compiled" into the index, in that all joined child documents must be indexed along with the parent document, as a single document block.

  • Different document types (for example, shirts and SKUs) must share a single index, which is wasteful as it means non-sparse data structures like FieldCache entries consume more memory than they would if you had separate indices.

  • If you need to re-index a parent document or any of its child documents, or delete or add a child, then the entire block must be re-indexed. This is a big problem in some cases, for example if you index "user reviews" as child documents then whenever a user adds a review you'll have to re-index that shirt as well as all its SKUs and user reviews.

  • There is no QueryParser support, so you need to programmatically create the parent and child queries, separating according to parent and child fields.

  • The join can currently only go in one direction (mapping child docIDs to parent docIDs), but in some cases you need to map parent docIDs to child docIDs. For example, when searching songs, perhaps you want all matching songs sorted by their title. You can't easily do this today because the only way to get song hits is to group by album or band/artist.

  • The join is a one (parent) to many (children), inner join.
As usual, patches are welcome!

There is work underway to create a more flexible, but likely less performant, query-time join capability, which should address a number of the above limitations.

Thursday, November 10, 2011

SearcherLifetimeManager prevents a broken search user experience



In the past, search indices were usually very static: you built them once, called optimize at the end and shipped them off, and didn't change them very often.

But these days it's just the opposite: most applications have very dynamic indices, constantly being updated with a stream of changes, and you never call optimize anymore.

Lucene's near-real-time search, especially with recent improvements including manager classes to handle the tricky complexities of sharing searchers across threads, offers very fast search turnaround on index changes.

But there is a serious yet often overlooked problem with this approach. To see it, you have to put yourself in the shoes of a user. Imagine Alice comes to your site, runs a search, and is looking through the search results. Not satisfied, after a few seconds she decides to refine that first search. Perhaps she drills down on one of the nice facets you presented, or maybe she clicks to the next page, or picks a different sort criteria (any follow-on action will do). So a new search request is sent back to your server, including the first search plus the requested change (drill down, next page, change sort field, etc.).

How do you handle this follow-on search request? Just pull the latest and greatest searcher from your SearcherManager or NRTManager and search away, right?

Wrong!

If you do this, you risk a broken search experience for Alice, because the new searcher may be different from the original searcher used for Alice's first search request. The differences could be substantial, if you had just opened a new searcher after updating a bunch of documents. This means the results of Alice's follow-on search may have shifted: facet counts are now off, hits are sorted differently so some hits may be duplicated on the second page, or may be lost (if they moved from page 2 to page 1), etc. If you use the new (will be in Lucene 3.5.0) searchAfter API, for efficient paging, the risk is even greater!

Perversely, the frequent searcher reopening that you thought provides such a great user experience by making all search results so fresh, can in fact have just the opposite effect. Each reopen risks breaking all current searches in your application; the more active your site, the more searches you might break!

It's deadly to intentionally break a user's search experience: they will (correctly) conclude your search is buggy, eroding their trust, and then take their business to your competition.

It turns out, this is easy to fix! Instead of pulling the latest searcher for every incoming search request, you should try to pull the same searcher used for the initial search request in the session. This way all follow-on searches see exactly the same index.

Fortunately, there's a new class coming in Lucene 3.5.0, that simplifies this: SearcherLifetimeManager. The class is agnostic to how you obtain the fresh searchers (i.e., SearcherManager, NRTManager, or your own custom source) used for an initial search. Just like Lucene's other manager classes, SearcherLifetimeManager is very easy to use. Create the manager once, up front:
  SearcherLifetimeManager mgr = new SearcherLifetimeManager();
Then, when a search request arrives, if it's an initial (not follow-on) search, obtain the most current searcher in the usual way, but then record this searcher:
  long token = mgr.record(searcher);
The returned token uniquely identifies the specific searcher; you must save it somewhere the user's search results, for example by placing it in a hidden HTML form field.

Later, when the user performs a follow-on search request, make sure the original token is sent back to the server, and then use it to obtain the same searcher:
  // If possible, obtain same searcher version as last
  // search:
  IndexSearcher searcher = mgr.acquire(token);
  if (searcher != null) {
    // Searcher is still here
    try {
      // do searching...
    } finally {
      mgr.release(searcher);
      // Do not use searcher after this!
      searcher = null;
    }
  } else {
    // Searcher was pruned -- notify user session timed
    // out
  }
As long as the original searcher is still available, the manager will return it to you; be sure to release that searcher (ideally in a finally clause).

It's possible searcher is no longer available: for example if Alice ran a new search, but then got hungry, went off to a long lunch, and finally returned then clicked "next page", likely the original searcher will have been pruned!

You should gracefully handle this case, for example by notifying Alice that the search had timed out and asking her to re-submit the original search (which will then get the latest and greatest searcher). Fortunately, you can reduce how often this happens, by controlling how aggressively you prune old searchers:
  mgr.prune(new PruneByAge(600.0));
This removes any searchers older than 10 minutes (you can also implement a custom pruning strategy). You should call it from a separate dedicated thread (not a searcher thread), ideally the same thread that's periodically indexing changes and opening new searchers.

Keeping many searchers around will necessarily tie up resources (open file descriptors, RAM, index files on disk that the IndexWriter would otherwise have deleted). However, because the reopened searchers share sub-readers, the resource consumption will generally be well contained, in proportion to how many index changes occurred between each reopen. Just be sure to use NRTCachingDirectory, to ensure you don't bump up against open file descriptor limits on your operating system (this also gives a good speedup in reopen turnaround time).

Don't erode your users' trust by intentionally breaking their searches!

LUCENE-3486 has the details.

Thursday, November 3, 2011

Near-real-time readers with Lucene's SearcherManager and NRTManager

Last time, I described the useful SearcherManager class, coming in the next (3.5.0) Lucene release, to periodically reopen your IndexSearcher when multiple threads need to share it. This class presents a very simple acquire/release API, hiding the thread-safe complexities of opening and closing the underlying IndexReaders.

But that example used a non near-real-time (NRT) IndexReader, which has relatively high turnaround time for index changes to become visible, since you must call IndexWriter.commit first.

If you have access to the IndexWriter that's actively changing the index (i.e., it's in the same JVM as your searchers), use an NRT reader instead! NRT readers let you decouple durability to hardware/OS crashes from visibility of changes to a new IndexReader. How frequently you commit (for durability) and how frequently you reopen (to see new changes) become fully separate decisions. This controlled consistency model that Lucene exposes is a nice "best of both worlds" blend between the traditional immediate and eventual consistency models.

Since reopening an NRT reader bypasses the costly commit, and shares some data structures directly in RAM instead of writing/reading to/from files, it provides extremely fast turnaround time on making index changes visible to searchers. Frequent reopens such as every 50 milliseconds, even under relatively high indexing rates, is easily achievable on modern hardware.

Fortunately, it's trivial to use SearcherManager with NRT readers: use the constructor that takes IndexWriter instead of Directory:
  boolean applyAllDeletes = true;
  ExecutorService es = null;
  SearcherManager mgr = new SearcherManager(writer, applyAllDeletes,
                                            new MySearchWarmer(), es);
This tells SearcherManager that its source for new IndexReaders is the provided IndexWriter instance (instead of a Directory instance). After that, use the SearcherManager just as before.

Typically you'll set the applyAllDeletes boolean to true, meaning each reopened reader is required to apply all previous deletion operations (deleteDocuments or updateDocument/s) up until that point.

Sometimes your usage won't require deletions to be applied. For example, perhaps you index multiple versions of each document over time, always deleting the older versions, yet during searching you have some way to ignore the old versions. If that's the case, you can pass applyAllDeletes=false instead. This will make the turnaround time quite a bit faster, as the primary-key lookups required to resolve deletes can be costly. However, if you're using Lucene's trunk (to be eventually released as 4.0), another option is to use MemoryCodec on your id field to greatly reduce the primary-key lookup time.

Note that some or even all of the previous deletes may still be applied even if you pass false. Also, the pending deletes are never lost if you pass false: they remain buffered and will still eventually be applied.

If you have some searches that can tolerate unapplied deletes and others that cannot, it's perfectly fine to create two SearcherManagers, one applying deletes and one not.

If you pass a non-null ExecutorService, then each segment in the index can be searched concurrently; this is a way to gain concurrency within a single search request. Most applications do not require this, because the concurrency across multiple searches is sufficient. It's also not clear that this is effective in general as it adds per-segment overhead, and the available concurrency is a function of your index structure. Perversely, a fully optimized index will have no concurrency! Most applications should pass null.



NRTManager

What if you want the fast turnaround time of NRT readers, but need control over when specific index changes become visible to certain searches? Use NRTManager!

NRTManager holds onto the IndexWriter instance you provide and then exposes the same APIs for making index changes (addDocument/s, updateDocument/s, deleteDocuments). These methods forward to the underlying IndexWriter, but then return a generation token (a Java long) which you can hold onto after making any given change. The generation only increases over time, so if you make a group of changes, just keep the generation returned from the last change you made.

Then, when a given search request requires certain changes to be visible, pass that generation back to NRTManager to obtain a searcher that's guaranteed to reflect all changes for that generation.

Here's one example use-case: let's say your site has a forum, and you use Lucene to index and search all posts in the forum. Suddenly a user, Alice, comes online and adds a new post; in your server, you take the text from Alice's post and add it as a document to the index, using NRTManager.addDocument, saving the returned generation. If she adds multiple posts, just keep the last generation.

Now, if Alice stops posting and runs a search, you'd like to ensure her search covers all the posts she just made. Of course, if your reopen time is fast enough (say once per second), unless Alice types very quickly, any search she runs will already reflect her posts.

But pretend for now you reopen relatively infrequently (say once every 5 or 10 seconds), and you need to be certain Alice's search covers her posts, so you call NRTManager.waitForGeneration to obtain the SearcherManager to use for searching. If the latest searcher already covers the requested generation, the method returns immediately. Otherwise, it blocks, requesting a reopen (see below), until the required generation has become visible in a searcher, and then returns it.

If some other user, say Bob, doesn't add any posts and runs a search, you don't need to wait for Alice's generation to be visible when obtaining the searcher, since it's far less important when Alice's changes become immediately visible to Bob. There's (usually!) no causal connection between Alice posting and Bob searching, so it's fine for Bob to use the most recent searcher.

Another use-case is an index verifier, where you index a document and then immediately search for it to perform end-to-end validation that the document "made it" correctly into the index. That immediate search must first wait for the returned generation to become available.

The power of NRTManager is you have full control over which searches must see the effects of which indexing changes; this is a further improvement in Lucene's controlled consistency model. NRTManager hides all the tricky details of tracking generations.

But: don't abuse this! You may be tempted to always wait for last generation you indexed for all searches, but this would result in very low search throughput on concurrent hardware since all searches would bunch up, waiting for reopens. With proper usage, only a small subset of searches should need to wait for a specific generation, like Alice; the rest will simply use the most recent searcher, like Bob.

Managing reopens is a little trickier with NRTManager, since you should reopen at higher frequency whenever a search is waiting for a specific generation. To address this, there's the useful NRTManagerReopenThread class; use it like this:
  double minStaleSec = 0.025;
  double maxStaleSec = 5.0;
  NRTManagerReopenThread thread = new NRTManagerReopenThread(
                                       nrtManager,
           maxStaleSec,
           minStaleSec);
  thread.start();
  ...
  thread.close();
The minStaleSec sets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for a specific generation (Alice, above), meaning the longest such a search should have to wait is approximately 25 msec.

The maxStaleSec sets a lower bound on how frequently reopens should occur. This is used for the periodic "ordinary" reopens, when there is no request waiting for a specific generation (Bob, above); this means any changes done to the index more than approximately 5.0 seconds ago will be seen when Bob searches. Note that these parameters are approximate targets and not hard guarantees on the reader turnaround time. Be sure to eventually call thread.close(), when you are done reopening (for example, on shutting down the application).

You are also free to use your own strategy for calling maybeReopen; you don't have to use NRTManagerReopenThread. Just remember that getting it right, especially when searches are waiting for specific generations, can be tricky!

Tuesday, October 25, 2011

Accuracy and performance of Google's Compact Language Detector

To get a sense of the accuracy and performance of Google's Compact Language Detector, I ran some tests against two other packages:


For the test corpus I used a the corpus described here, created by the author of language-detection. It contains 1000 texts from each of 21 languages, randomly sampled from the Europarl corpus.

It's not a perfect test (no test ever is!): the content is already very clean plain text; there are no domain, language, encoding hints to apply (which you'd normally have with HTML content loaded over HTTP); it "only" covers 21 languages (versus at least 76 that CLD can detect).

CLD and language-detection cover all 21 languages, but Tika is missing Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv), so I only tested on the remaining subset of 17 languages that all three detectors support. This works out to 17,000 texts totalling 2.8 MB.

Many of the texts are very short, making the test challenging: the shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or less.

In addition to the challenges of the corpora, the differences in the detectors make the comparison somewhat apples to oranges. For example, CLD detects at least 76 languages, while language-detection detects 53 and Tika detects 27, so this biases against CLD, and language-detection to a lesser extent, since their classification task is harder relative to Tika's.

For CLD, I disabled its option to abstain (removeWeakMatches), so that it always guesses at the language even when confidence is low, to match the other two detectors. I also turned off the pickSummaryLanguage, as this was also hurting accuracy; now CLD simply picks the highest scoring match as the detected language.

For language-detection, I ran with the default ALPHA of 0.5, and set the random seed to 0.

Here are the raw results:

CLD results (total 98.82% = 16800 / 17000):
     da  93.4%   da=934  nb=54  sv=5  fr=2  eu=2  is=1  hr=1  en=1  
     de  99.6%   de=996  en=2  ga=1  cy=1          
     el  100.0%   el=1000                
     en  100.0%   en=1000                
     es  98.3%   es=983  pt=4  gl=3  en=3  it=2  eu=2  id=1  fi=1  da=1
     et  99.6%   et=996  ro=1  id=1  fi=1  en=1        
     fi  100.0%   fi=1000                
     fr  99.2%   fr=992  en=4  sq=2  de=1  ca=1        
     hu  99.9%   hu=999  it=1              
     it  99.5%   it=995  ro=1  mt=1  id=1  fr=1  eu=1      
     nl  99.5%   nl=995  af=3  sv=1  et=1          
     pl  99.6%   pl=996  tr=1  sw=1  nb=1  en=1        
     pt  98.7%   pt=987  gl=4  es=3  mt=1  it=1  is=1  ht=1  fi=1  en=1
     ro  99.8%   ro=998  da=1  ca=1            
     sk  98.8%   sk=988  cs=9  en=2  de=1          
     sl  95.1%   sl=951  hr=32  sr=8  sk=5  en=2  id=1  cs=1    
     sv  99.0%   sv=990  nb=9  en=1            


Tika results (total 97.12% = 16510 / 17000):
     da  87.6%   da=876  no=112  nl=4  sv=3  it=1  fr=1  et=1  en=1  de=1        
     de  98.5%   de=985  nl=3  it=3  da=3  sv=2  fr=2  sl=1  ca=1          
     el  100.0%   el=1000                        
     en  96.9%   en=969  no=10  it=6  ro=4  sk=3  fr=3  hu=2  et=2  sv=1        
     es  89.8%   es=898  gl=47  pt=22  ca=15  it=6  eo=4  fr=3  fi=2  sk=1  nl=1  et=1    
     et  99.1%   et=991  fi=4  fr=2  sl=1  no=1  ca=1              
     fi  99.4%   fi=994  et=5  hu=1                    
     fr  98.0%   fr=980  sl=6  eo=3  et=2  sk=1  ro=1  no=1  it=1  gl=1  fi=1  es=1  de=1  ca=1
     hu  99.9%   hu=999  ca=1                      
     it  99.4%   it=994  eo=4  pt=1  fr=1                  
     nl  97.8%   nl=978  no=8  de=3  da=3  sl=2  ro=2  pl=1  it=1  gl=1  et=1      
     pl  99.1%   pl=991  sl=3  sk=2  ro=1  it=1  hu=1  fi=1            
     pt  94.4%   pt=944  gl=48  hu=2  ca=2  it=1  et=1  es=1  en=1          
     ro  99.3%   ro=993  is=2  sl=1  pl=1  it=1  hu=1  fr=1            
     sk  96.2%   sk=962  sl=21  pl=13  it=2  ro=1  et=1              
     sl  98.5%   sl=985  sk=7  et=4  it=2  pt=1  no=1              
     sv  97.1%   sv=971  no=15  nl=6  da=6  de=1  ca=1              


Language-detection results (total 99.22% = 16868 / 17000):
     da  97.1%   da=971  no=28  en=1      
     de  99.8%   de=998  da=1  af=1      
     el  100.0%   el=1000          
     en  99.7%   en=997  nl=1  fr=1  af=1    
     es  99.5%   es=995  pt=4  en=1      
     et  99.6%   et=996  fi=2  de=1  af=1    
     fi  99.8%   fi=998  et=2        
     fr  99.8%   fr=998  sv=1  it=1      
     hu  99.9%   hu=999  id=1        
     it  99.8%   it=998  es=2        
     nl  97.7%   nl=977  af=21  sv=1  de=1    
     pl  99.9%   pl=999  nl=1        
     pt  99.4%   pt=994  es=3  it=1  hu=1  en=1  
     ro  99.9%   ro=999  fr=1        
     sk  98.7%   sk=987  cs=8  sl=2  ro=1  lt=1  et=1
     sl  97.2%   sl=972  hr=27  en=1      
     sv  99.0%   sv=990  no=8  da=2      


Some quick analysis:
  • The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!

  • The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.

When language-detection was wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it's not the case that they are all always wrong on the short texts.

This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.

You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with language-detection choice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!

Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:

CLD 171 msec 16.331 MB/sec
language-detection 2367 msec 1.180 MB/sec
Tika 42219 msec 0.066 MB/sec


CLD is incredibly fast! language-detection is an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).

I used the 09-13-2011 release of language-detection, the current trunk (svn revision 1187915) of Apache Tika, and the current trunk (hg revision b0adee43f3b1) of CLD. All sources for the performance tests are available from here.

Monday, October 24, 2011

Additions to Compact Language Detector API


I've made some small improvements after my quick initial port of Google's Compact Language Detection Library, starting with some helpful Python constants:

  • cld.ENCODINGS has all the encoding names recognized by CLD; if you pass the encoding hint it must be one of these.

  • cld.LANGUAGES has the list of all base languages known (but not necessarily detectable) by CLD.

  • cld.EXTERNAL_LANGUAGES has the list of external languages known (but not necessarily detectable) by CLD.

  • cld.DETECTED_LANGUAGES has the list of detectable languages.

I haven't found a reliable way to get the full list of detectable languages;  for now, I've started with all languages that are covered by the unit test, total count 75, which should be a lower bound on the true count.

I also exposed control over whether CLD should abstain from a given matched language if the confidence is too low, by adding a parameter removeWeakMatches (required in C and optional in Python, default False).  Turn this option on if abstaining is OK in your use case, such as a browser toolbar offering to translate content.  Turn it off when testing accuracy vs other language detection libraries (unless they also abstain!).

Finally, CLD has an algorithm that tries to pick the best "summary" language, and it doesn't always just pick the highest scoring match. For example, the code has this comment:
    // If English and X, where X (not UNK) is big enough,
    // assume the English is boilerplate and return X.
See the CalcSummaryLanguage function for more details!

I found this was hurting accuracy in testing so I added a parameter pickSummaryLanguage (default False) to also turn this on or off.

Finally, I fixed the Python binding to release the GIL while CLD is running, so multiple threads can now detect without falsely blocking one another.