Monday, May 14, 2012
Finite State Automata in Lucene
You can view the slides here.
This was the first time I used Google Docs exclusively for a talk, and I was impressed! The real-time collaboration was awesome: we each could see the edits the other was doing, live. You never have to "save" your document: instead, every time you make a change, the document is saved to a new revision and you can then use infinite undo, or step back through all revisions, to go back.
Finally, Google Docs covers the whole life-cycle of your talk: editing/iterating, presenting (it presents in full-screen just fine, but does require an internet connection; I exported to PDF ahead of time as a backup) and, finally, sharing with the rest of the world!
Monday, April 30, 2012
Lucene's TokenStreams are actually graphs!
The
TokenStream
is actually a chain, starting with
a Tokenizer
that splits characters into initial tokens, followed by any number of
TokenFilters
that modify the tokens. You can also use
a CharFilter
to pre-process the characters before tokenization, for example to
strip
out HTML markup,
remap
character sequences
or replace
characters according to a regular expression, while preserving the proper offsets back
into the original input string. Analyzer is the factory class that
creates TokenStream
s when needed.
Lucene and Solr have a wide variety of
Tokenizer
s
and TokenFilter
s, including support
for at least 34
languages.
Let's tokenize a simple example:
fast wi fi network is
down
. Assume we
preserve stop words. When viewed as a graph, the tokens look like
this:

Each node is a position, and each arc is a token. The
TokenStream
enumerates a directed
acyclic graph, one arc at a time.
Next, let's add SynoynmFilter into our analysis chain, applying these synonyms:
-
fast
→speedy
-
wi fi
→wifi
-
wi fi network
→hotspot

Now the graph is more interesting! For each token (arc), the PositionIncrementAttribute tells us how many positions (nodes) ahead this arc starts from, while the new (as of 3.6.0) PositionLengthAttribute tells us how many positions (nodes) ahead the arc arrives to.
Besides
SynonymFilter
, several other analysis components now produce token graphs.
Kuromoji's JapaneseTokenizer
outputs the decompounded form for compound tokens. For
example, tokens like ショッピングセンター (shopping center) will also
have an alternate path with ショッピング (shopping) followed by センター (center).
Both ShingleFilter
and CommonGramsFilter
set the position length to 2 when they merge two input tokens.
Other analysis components should produce a graph but don't yet (patches welcome!): WordDelimiterFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, NGramTokenFilter, EdgeNGramTokenFilter, and likely others.
Limitations
There are unfortunately several hard-to-fix problems with token graphs. One problem is that the indexer completely ignoresPositionLengthAttribute
; it only pays attention
to PositionIncrementAttribute
. This means the indexer
acts as if all arcs always arrive at the very next position, so for the above graph
we actually index this:

This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi"). Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage. This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.
QueryParser also ignores position length, however this should be easier to fix. This would mean you can run graph analyzers at query time (i.e., query time expansion) and get the correct results.
Another problem is that
SynonymFilter
also unexpectedly
performs its own form of sausagization when the injected synonym is
more than one token. For example if you have this rule:
-
dns
→domain name service

Notice how
name
was overlapped onto is
,
and service
was overlapped onto up
. It's an
odd word salad!
This of course also messes up phrase queries ("domain name service is up" should match but doesn't, while "dns name up" shouldn't match but does). To work around this problem you should ensure all of your injected synonyms are single tokens! For this case, you could run the reverse mapping (
domain name service
→ dns
) at query
time (as well as indexing time) and then both queries dns
and domain name service
will match any document
containing either variant.
This happens because
SynonymFilter
never creates new
positions; if it did so, it could make new positions for
tokens in domain name service
, and then
change dns
to position length 3.
Another problem is that
SynonymFilter
, like the indexer,
also ignores the position length of the incoming tokens: it cannot
properly consume a token graph. So if you added a
second SynonymFilter
it would fail to match
hotspot is down
.
We've only just started but bit by bit our token streams are producing graphs!
Saturday, April 28, 2012
Lucene has two Google Summer of Code students!
The first project (LUCENE-3312), proposed by Nikola Tanković, will separate
StorableField
out of IndexableField
, and also fix the longstanding confusing trap that one can retrieve a Document
at search time and re-index it, without losing anything. It's unfortunately not true!
The second project (LUCENE-3892), proposed by Han Jiang, will create a postings format using PForDelta compression (PDF). Long ago we saw great performance from PForDelta, but with a hacked up prototype patch that couldn't be committed. The goal of this project is to implement PForDelta "for real" and characterize the resulting performance.
Welcome aboard Nikola and Han!
Wednesday, March 14, 2012
New index statistics in Lucene 4.0
Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.
To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":
- document 1: The Lion, the Witch, and the Wardrobe
- document 2: The Da Vinci Code
-
TermsEnum.docFreq()
- How many documents contain at least one
occurrence of the term in the field; 3.x
indices also save this (
TermEnum.docFreq()
). For term "lion" docFreq is 1, and for term "the" it's 2. -
Terms.getSumDocFreq()
- Number of postings, i.e. sum
of
TermsEnum.docFreq()
across all terms in the field. For our example documents this is 9. -
TermsEnum.totalTermFreq()
- Number of occurrences of this term in the field, across all documents. For term "the" it's 4, for term "vinci" it's 1.
-
Terms.getSumTotalTermFreq()
- Number of term occurrences
in the field, across all documents; this is the sum
of
TermsEnum.totalTermFreq()
across all unique terms in the field. For our example documents this is 11. -
Terms.getDocCount()
- How many documents have at least one term for this field. In our example documents, this is 2, but if for example one of the documents was missing the title field, it would be 1.
-
Terms.getUniqueTermCount()
- How many unique terms were seen in this field. For our example documents this is 8. Note that this statistic is of limited utility for scoring, because it's only available per-segment and you cannot (efficiently!) compute this across all segments in the index (unless there is only one segment).
-
Fields.getUniqueTermCount()
- Number of unique terms
across all fields; this is the sum
of
Terms.getUniqueTermCount()
across all fields. In our example documents this is 8. Note that this is also only available per-segment. -
Fields.getUniqueFieldCount()
- Number of unique fields. For our example documents this is 1; if we also had a body field and an abstract field, it would be 3. Note that this is also only available per-segment.
TermsEnum.docFreq()
, so if you want
to experiment with the new scoring models in Lucene 4.0, you should
either re-index or upgrade your index
using IndexUpgrader
. Note that the new scoring models all
use the same single-byte norms format, so you can freely switch between them
without re-indexing.
In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the
FieldInvertState
passed to Similarity.computeNorm
method for
both 3.x and 4.0:
-
length
- How many tokens in the document. For document 1 it's 7; for document 2 it's 4.
-
uniqueTermCount
- For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.
-
maxTermFrequency
- What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.
From these available statistics you're now free to derive other commonly used statistics:
- Average field length across all documents is
Terms.getSumTotalTermFreq()
divided bymaxDoc
(orTerms.getDocCount()
, if not all documents have the field).
- Average within-document field term frequency is
FieldInvertState.length
divided byFieldInvertState.uniqueTermCount
.
- Average number of unique terms per field across all documents is
Terms.getSumDocFreq()
divided bymaxDoc
(orTerms.getDocCount(field)
, if not all documents have the field).
Monday, March 5, 2012
Crowd-data can find drug and vaccine side effects
I suspect crowd-data will similarly become an effective way to create large, realistic databases.
A great application of this is the medical world, where many people post to health forums raising medical problems, possible side effects from drugs and vaccines, etc. Why not collect all such posts to find previously undiscovered problems? In fact, this paper describes just that: the authors extracted the nasty side effects of statin drugs based on posts to online health forums. Similarly, this abstract describes a system that used crowd-data to spot nasty side effects from Singulair, years before the FDA issued a warning. The VAERS database, which gathers parent-reported problems after children receive vaccines, is another example.
Unfortunately the drug safety trials that take place before a drug can be released are not especially trustworthy. Here's a scary quote from that interview:
When you look at the highest quality medical studies, the odds that a study will favor the use of a new drug are 5.3 times higher for commercially funded studies than for noncommercially funded studies.
And that was 7 years ago! I imagine the situation has only gotten worse.
When a new drug is released, the true, unbiased drug trial begins when millions of guinea-pigs start taking taking it. Crowd-data makes it possible to draw conclusions from that that post-market drug trial.
Of course there are challenging tradeoffs: crowd-data, being derived from "ordinary people" without any rigorous standard collection process, can be dirty, incomplete and reflect sampling bias (only people experiencing nasty side effects speak up). For these reasons, old-fashioned journals turn their noses up at papers drawing conclusions from crowd-data.
Nevertheless, I believe such limitations are more than offset by the real-time nature and shear scale the millions of people, constantly posting information over time. Inevitably, trustworthy patterns will emerge over the noise. Unlike the synthetic drug trial, this data is as real as you can get: sure, perhaps the drug seemed fine in the carefully controlled pre-market testing, but then out in the real world, unexpected interactions can suddenly emerge. Crowd-data will enable us to find such cases quickly and reliably, as long as we still have enough willing guinea-pigs!
Fast forward a few years and I expect crowd-data will be an excellent means of drawing conclusions, and will prove more reliable than the company-funded pre-market drug trials.
Thursday, March 1, 2012
Transactional Lucene
- Atomicity: when you make changes (adding, removing
documents) in an
IndexWriter
session, and then commit, either all (if the commit succeeds) or none (if the commit fails) of your changes will be visible, never something in-between. Some methods have their own atomic behavior: if you callupdateDocument
, which is implemented as a delete followed by an add, you'll never see the delete without the add, even if you open a near-real-time (NRT) reader or commit from a separate thread. Similarly if you add a block of documents, using the relatively newaddDocuments
method, you'll see either none or all of the documents in any reader you obtain.
- Consistency: if the computer or OS crashes, or the JVM
crashes or is killed, or power is lost, your index will remain intact
(ie, not corrupt). Note that other problems, such as bad RAM, a
bit-flipping CPU or file system corruption, can still easily corrupt
the index!
- Isolation: while
IndexWriter
is making changes, nothing is visible to anyIndexReader
searching the index, until you commit or open a new NRT reader. Only oneIndexWriter
instance at a time can change the index.
- Durability: once
commit
returns, all changes have been written to durable storage (assuming your I/O system correctly implementsfsync
). If the computer or OS crashes, or the JVM crashes or is killed, or power is lost to the computer, all changes will still be present in the index.
Lucene provides
a two-phased
commit API: call the prepareCommit
method to do all of the hard work
(applying buffered deletes, writing buffered documents, fsyncing files). If something
is going to go wrong (e.g., disk fills up) it'll almost certainly happen during
this first phase. Then, call
commit
to complete the transaction.
When you close the IndexWriter
, it calls commit
under
the hood. If, instead, you want to discard all changes since the last
commit, call the rollback
method instead, which also
closes the writer. You can even rollback a CREATE
: if
you have an existing index, and you open an IndexWriter
on it with OpenMode.CREATE
, and then rollback, the index will
be unchanged. Likewise, if you call deleteAll
and then
rollback.
Note that merely opening an IndexWriter
on a new
directory does not create an empty commit; ie, you cannot open an IndexReader
on
the directory until you've called commit
yourself.
Lucene does not implement
a transaction
log itself, but it's easy to build that layer out on top. For example, popular search servers such
as Solr and
ElasticSearch, do so.
Multiple commits in one index
A single Lucene index is free to contain more than one commit; this is
a powerful yet often overlooked feature. Each commit holds a
point-in-time view of the index as it existed when the commit was
created.
This is similar to the snapshots
and writable clones available in modern filesystems like
ZFS and the
up-and-coming Btrfs. In
fact, Lucene is able to efficiently expose multiple commits for the
very same underlying reason: all index segments and files are write-once, just
like the file blocks in ZFS and Btrfs.
To save multiple commits in your index, just implement your
own IndexDeletionPolicy
and pass it to IndexWriter
. This is the class Lucene uses
to know which commits should be deleted: IndexWriter
invokes it on opening an index
and whenever a commit succeeds. The default
policy, KeepOnlyLastCommitDeletionPolicy
, deletes all but
the last commit. If you use NoDeletionPolicy
then every
commit is retained!
You can pass a userData
(Map<String,String>
) to commit
, to record custom
information (opaque to Lucene) about that commit, and then use IndexReader.listCommits
to
find all commits in the index. Once you've found a commit, you can
open an IndexReader
on it to search the index as of that commit.
You can also open an IndexWriter
on a prior commit, to effectively roll back all
changes after it: this is just like the rollback
method, except it enables you to rollback across commits and
not just the changes made in the current IndexWriter
session.
Old commits are still kept even when you open an index
with OpenMode.CREATE
. It's also fine to
pass OpenMode.CREATE
when IndexReader
s are still
searching the old commits. This enables fun use cases, such as
fully re-indexing your content between each commit without affecting
any open readers.
Combining all of these fun transactional features, you can do some cool things:
- Hot
backups, using
SnapshotDeletionPolicy
orPersistentSnapshotDeletionPolicy
: these deletion policies make it trivial to take a "live" backup of the index without blocking ongoing changes withIndexWriter
. The backup can easily be incremental (just copy the new files, remove the deleted ones), and you can freely throttle the IO to minimize any interference with searching.
- Searching different catalog versions: perhaps you run an e-commerce site, and but you ship multiple
versions of your catalog. In this case you can keep older commits
around, each searching a specific version of your catalog, enabling
users to choose which catalog to search.
- Repeatable indexing tests from the same initial index: maybe you
want to run a bunch of performance tests, perhaps trying different
RAM buffer sizes or merge factors, starting from a large initial
index. To do this, simply run each test, but in the end, instead of
closing the
IndexWriter
, use therollback
method to quickly return the index to its initial state, ready for the next test.
- Force all index segments to be merged down to a single segment, but
also keep the prior multi-segment commit. Then you can do
tests to compare multi-segment vs single-segment performance.
- Indexing and searching over the NFS
file system: because NFS does not protect still-open files from
deletion, you must use an
IndexDeletionPolicy
to keep each commit around until all open readers have finished with the commit (ie, reopened to a newer commit). The simple approach is time-based, for example: don't delete the commit until it is 15 minutes old, and then always reopen your readers every 5 minutes. Without this you'll hit all sorts of scary exceptions when searching over NFS.
- Distributed commit: if you have other resources that must commit
atomically along with the changes to your Lucene index, you can use
the two-phased commit API. This is simple, but vulnerable to failures during the 2nd phaes; to also recover from such cases, for example if Lucene completed its 2nd phase
commit but the database's 2nd phase hit some error or crash or power loss, you can easily
rollback Lucene's commit by opening an
IndexWriter
on the prior commit.
- Experimental index changes: maybe you want to try re-indexing some
subset of your index in a new way, but you're not sure it'll work
out. In this case, just keep the old commit around, and then
rollback if it didn't work out, or delete the old commit if it did.
- Time-based snapshots: maybe you'd like the freedom to roll back to your index as it existed 1 day ago, 1 week ago, 1 month ago, etc., so you preserve commits based on their age.
Saturday, January 14, 2012
ToChildBlockJoinQuery in Lucene
BlockJoinQuery
: it joins in only one direction (from
child to parent documents). This can be a problem because some
applications need to join in reverse (from parent to child documents)
instead.
This is now fixed! I just committed a new query,
ToChildBlockJoinQuery
, to perform the join
in the opposite direction. I also renamed the previous query
to ToParentBlockJoinQuery
.
You use it just like
BlockJoinQuery
, except in reverse:
it wraps any other Query
matching parent documents and
translates it into a Query
matching child documents. The
resulting Query
can then be combined with other queries
against fields in the child documents, and you can then sort by child
fields as well.
Using songs and albums as an example: imagine you index each song (child) and album (parent) as separate documents in a single document block. With
ToChildBlockJoinQuery
, you can now
run queries like:
albumName:thunder AND songName:numbor
albumName:thunder, sort by songTitleAny query with constraints against album and/or song fields will work, and the returned hits will be individual songs (not grouped).
ToChildBlockJoinQuery
will be available in Lucene 3.6.0
and 4.0.
Sunday, January 8, 2012
Searching relational content with Lucene's BlockJoinQuery
Most search engines can't directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.
Perhaps the PDF documents you need to search are immense, so you break them up and index each section as a separate Lucene document; in this case you'll have common fields (title, abstract, author, date published, etc.) for the overall document, joined to the sub-document (section) with its own fields (text, page number, etc.). XML documents typically contain nested tags, representing joined sub-documents; emails have attachments; office documents can embed other documents. Nearly all search domains have some form of relational content, often requiring more than one join.
If such content is so common then how do search applications handle it today?
One obvious "solution" is to simply use a relational database instead of a search engine! If relevance scores are less important and you need to do substantial joining, grouping, sorting, etc., then using a database could be best overall. Most databases include some form a text search, some even using Lucene.
If you still want to use a search engine, then one common approach is to denormalize the content up front, at index-time, by joining all tables and indexing the resulting rows, duplicating content in the process. For example, you'd index each song as a Lucene document, copying over all fields from the song's joined album and artist/band. This works correctly, but can be horribly wasteful as you are indexing identical fields, possibly including large text fields, over and over.
Another approach is to do the join yourself, outside of Lucene, by indexing songs, albums and artist/band as separate Lucene documents, perhaps even in separate indices. At search-time, you first run a query against one collection, for example the songs. Then you iterate through all hits, gathering up (joining) the full set of corresponding albums and then run a second query against the albums, with a large OR'd list of the albums from the first query, repeating this process if you need to join to artist/band as well. This approach will also work, but doesn't scale well as you may have to create possibly immense follow-on queries.
Yet another approach is to use a software package that has already implemented one of these approaches for you! elasticsearch, Apache Solr, Apache Jackrabbit, Hibernate Search and many others all handle relational content in some way.
With
BlockJoinQuery
you can now directly search
relational content yourself!
Let's work through a simple example: imagine you sell shirts online. Each shirt has certain common fields such as name, description, fabric, price, etc. For each shirt you have a number of separate stock keeping units or SKUs, which have their own fields like size, color, inventory count, etc. The SKUs are what you actually sell, and what you must stock, because when someone buys a shirt they buy a specific SKU (size and color).
Maybe you are lucky enough to sell the incredible Mountain Three-wolf Moon Short Sleeve Tee, with these SKUs (size, color):
- small, blue
- small, black
- medium, black
- large, gray
name:wolf AND size=small AND color=bluewhich should match this shirt.
name
is a shirt field
while the size
and color
are SKU fields.
But if the user drills down instead on a small gray shirt:
name:wolf AND size=small AND color=graythen this shirt should not match because the small size only comes in blue and black.
How can you run these queries using
BlockJoinQuery
? Start by indexing each shirt
(parent) and all of its SKUs (children) as separate documents, using
the new IndexWriter.addDocuments
API to add one shirt and
all of its SKUs as a single document block. This method
atomically adds a block of documents into a single segment as adjacent
document IDs, which BlockJoinQuery
relies on. You should
also add a marker field to each shirt document (e.g. type =
shirt
), as BlockJoinQuery
requires
a Filter
identifying the parent documents.
To run a
BlockJoinQuery
at search-time, you'll first need
to create the parent filter, matching only shirts.
Note that the filter must use FixedBitSet
under the hood, like CachingWrapperFilter
:
Filter shirts = new CachingWrapperFilter( new QueryWrapperFilter( new TermQuery( new Term("type", "shirt"))));Create this filter once, up front and re-use it any time you need to perform this join.
Then, for each query that requires a join, because it involves both SKU and shirt fields, start with the child query matching only SKU fields:
BooleanQuery skuQuery = new BooleanQuery(); skuQuery.add(new TermQuery(new Term("size", "small")), Occur.MUST); skuQuery.add(new TermQuery(new Term("color", "blue")), Occur.MUST);Next, use
BlockJoinQuery
to translate hits from the SKU
document space up to the shirt document space:
BlockJoinQuery skuJoinQuery = new BlockJoinQuery( skuQuery, shirts, ScoreMode.None);The
ScoreMode
enum decides how scores for multiple SKU
hits should be aggregated to the score for the corresponding shirt
hit. In this query you don't need scores from the SKU matches, but if
you did you can aggregate
with Avg
, Max
or Total
instead.
Finally you are now free to build up an arbitrary shirt query using
skuJoinQuery
as a clause:
BooleanQuery query = new BooleanQuery(); query.add(new TermQuery(new Term("name", "wolf")), Occur.MUST); query.add(skuJoinQuery, Occur.MUST);You could also just run
skuJoinQuery
as-is if the query
doesn't have any shirt fields.
Finally, just run this
query
like normal! The
returned hits will be only shirt documents; if you'd also like to see
which SKUs matched for each shirt,
use BlockJoinCollector
:
BlockJoinCollector c = new BlockJoinCollector( Sort.RELEVANCE, // sort 10, // numHits true, // trackScores false // trackMaxScore ); searcher.search(query, c);The provided
Sort
must use only shirt fields (you cannot
sort by any SKU fields). When each hit (a shirt) is competitive, this
collector will also record all SKUs that matched for that shirt, which
you can retrieve like this:
TopGroupsSethits = c.getTopGroups( skuJoinQuery, skuSort, 0, // offset 10, // maxDocsPerGroup 0, // withinGroupOffset true // fillSortFields );
skuSort
to the sort order for the SKUs within each
shirt. The first offset
hits are skipped (use this for
paging through shirt hits). Under each shirt, at
most maxDocsPerGroup
SKUs will be returned.
Use withinGroupOffset
if you want to page within the
SKUs. If fillSortFields
is true then each SKU hit will
have values for the fields from skuSort
.
The hits returned by
BlockJoinCollector.getTopGroups
are SKU hits, grouped by shirt. You'd get the exact same results if
you had denormalized up-front and then used grouping to group results
by shirt.
You can also do more than one join in a single query; the joins can be nested (parent to child to grandchild) or parallel (parent to child1 and parent to child2).
However, there are some important limitations of index-time joins:
- The join must be computed at index-time and "compiled" into the
index, in that all joined child documents must be indexed along
with the parent document, as a single document block.
- Different document types (for example, shirts and SKUs) must
share a single index, which is wasteful as it means non-sparse
data structures like
FieldCache
entries consume more memory than they would if you had separate indices.
- If you need to re-index a parent document or any of its child
documents, or delete or add a child, then the entire block must be
re-indexed. This is a big problem in some cases, for example if
you index "user reviews" as child documents then whenever a user
adds a review you'll have to re-index that shirt as well as all
its SKUs and user reviews.
- There is no
QueryParser
support, so you need to programmatically create the parent and child queries, separating according to parent and child fields.
- The join can currently only go in one direction (mapping child
docIDs to parent docIDs), but in some cases you need to map parent
docIDs to child docIDs. For example, when searching songs,
perhaps you want all matching songs sorted by their title. You
can't easily do this today because the only way to get song hits
is to group by album or band/artist.
- The join is a one (parent) to many (children), inner join.
There is work underway to create a more flexible, but likely less performant, query-time join capability, which should address a number of the above limitations.
Thursday, November 10, 2011
SearcherLifetimeManager prevents a broken search user experience
In the past, search indices were usually very static: you built them once, called
optimize
at the end and shipped them off,
and didn't change them very often.
But these days it's just the opposite: most applications have very dynamic indices, constantly being updated with a stream of changes, and you never call
optimize
anymore.
Lucene's near-real-time search, especially with recent improvements including manager classes to handle the tricky complexities of sharing searchers across threads, offers very fast search turnaround on index changes.
But there is a serious yet often overlooked problem with this approach. To see it, you have to put yourself in the shoes of a user. Imagine Alice comes to your site, runs a search, and is looking through the search results. Not satisfied, after a few seconds she decides to refine that first search. Perhaps she drills down on one of the nice facets you presented, or maybe she clicks to the next page, or picks a different sort criteria (any follow-on action will do). So a new search request is sent back to your server, including the first search plus the requested change (drill down, next page, change sort field, etc.).
How do you handle this follow-on search request? Just pull the latest and greatest searcher from your
SearcherManager
or NRTManager
and search away, right?
Wrong!
If you do this, you risk a broken search experience for Alice, because the new searcher may be different from the original searcher used for Alice's first search request. The differences could be substantial, if you had just opened a new searcher after updating a bunch of documents. This means the results of Alice's follow-on search may have shifted: facet counts are now off, hits are sorted differently so some hits may be duplicated on the second page, or may be lost (if they moved from page 2 to page 1), etc. If you use the new (will be in Lucene 3.5.0)
searchAfter
API, for efficient paging, the risk is even greater!
Perversely, the frequent searcher reopening that you thought provides such a great user experience by making all search results so fresh, can in fact have just the opposite effect. Each reopen risks breaking all current searches in your application; the more active your site, the more searches you might break!
It's deadly to intentionally break a user's search experience: they will (correctly) conclude your search is buggy, eroding their trust, and then take their business to your competition.
It turns out, this is easy to fix! Instead of pulling the latest searcher for every incoming search request, you should try to pull the same searcher used for the initial search request in the session. This way all follow-on searches see exactly the same index.
Fortunately, there's a new class coming in Lucene 3.5.0, that simplifies this:
SearcherLifetimeManager
. The class is
agnostic to how you obtain the fresh searchers
(i.e., SearcherManager
, NRTManager
, or your
own custom source) used for an initial search.
Just like
Lucene's other
manager classes, SearcherLifetimeManager
is very
easy to use. Create the manager once, up front:
SearcherLifetimeManager mgr = new SearcherLifetimeManager();Then, when a search request arrives, if it's an initial (not follow-on) search, obtain the most current searcher in the usual way, but then record this searcher:
long token = mgr.record(searcher);The returned
token
uniquely identifies the specific
searcher; you must save it somewhere the user's search results, for
example by placing it in a hidden HTML form field.
Later, when the user performs a follow-on search request, make sure the original
token
is sent back to the server, and then
use it to obtain the same searcher:
// If possible, obtain same searcher version as last // search: IndexSearcher searcher = mgr.acquire(token); if (searcher != null) { // Searcher is still here try { // do searching... } finally { mgr.release(searcher); // Do not use searcher after this! searcher = null; } } else { // Searcher was pruned -- notify user session timed // out }As long as the original searcher is still available, the manager will return it to you; be sure to
release
that searcher
(ideally in a finally
clause).
It's possible searcher is no longer available: for example if Alice ran a new search, but then got hungry, went off to a long lunch, and finally returned then clicked "next page", likely the original searcher will have been pruned!
You should gracefully handle this case, for example by notifying Alice that the search had timed out and asking her to re-submit the original search (which will then get the latest and greatest searcher). Fortunately, you can reduce how often this happens, by controlling how aggressively you prune old searchers:
mgr.prune(new PruneByAge(600.0));This removes any searchers older than 10 minutes (you can also implement a custom pruning strategy). You should call it from a separate dedicated thread (not a searcher thread), ideally the same thread that's periodically indexing changes and opening new searchers.
Keeping many searchers around will necessarily tie up resources (open file descriptors, RAM, index files on disk that the
IndexWriter
would otherwise have deleted). However,
because the reopened searchers share sub-readers, the resource
consumption will generally be well contained, in proportion to how
many index changes occurred between each reopen. Just be sure to
use NRTCachingDirectory
, to ensure you don't bump up
against open file descriptor limits on your operating system (this
also gives a good speedup in reopen turnaround time).
Don't erode your users' trust by intentionally breaking their searches!
LUCENE-3486 has the details.
Thursday, November 3, 2011
Near-real-time readers with Lucene's SearcherManager and NRTManager
SearcherManager
class,
coming in the next (3.5.0) Lucene release, to periodically reopen your
IndexSearcher
when multiple threads need to share it.
This class presents a very
simple acquire
/release
API, hiding the
thread-safe complexities of opening and closing the
underlying IndexReader
s.
But that example used a non near-real-time (NRT)
IndexReader
, which has relatively high turnaround time
for index changes to become visible, since you must call
IndexWriter.commit
first.
If you have access to the
IndexWriter
that's actively
changing the index (i.e., it's in the same JVM as your searchers), use
an NRT reader instead! NRT readers let you
decouple durability to hardware/OS crashes
from visibility of changes to a new IndexReader
.
How frequently you commit (for durability) and how frequently you
reopen (to see new changes) become fully separate decisions.
This controlled
consistency model that Lucene exposes is a nice "best of both
worlds" blend between the
traditional immediate
and eventual
consistency models.
Since reopening an NRT reader bypasses the costly commit, and shares some data structures directly in RAM instead of writing/reading to/from files, it provides extremely fast turnaround time on making index changes visible to searchers. Frequent reopens such as every 50 milliseconds, even under relatively high indexing rates, is easily achievable on modern hardware.
Fortunately, it's trivial to use
SearcherManager
with NRT
readers: use the constructor that takes IndexWriter
instead of Directory
:
boolean applyAllDeletes = true; ExecutorService es = null; SearcherManager mgr = new SearcherManager(writer, applyAllDeletes, new MySearchWarmer(), es);This tells
SearcherManager
that its source for new
IndexReader
s is the provided IndexWriter
instance (instead of a Directory
instance). After that,
use the SearcherManager
just
as before.
Typically you'll set the
applyAllDeletes
boolean to
true
, meaning each reopened reader is required to apply
all previous deletion operations (deleteDocuments
or updateDocument/s
) up until that point.
Sometimes your usage won't require deletions to be applied. For example, perhaps you index multiple versions of each document over time, always deleting the older versions, yet during searching you have some way to ignore the old versions. If that's the case, you can pass
applyAllDeletes=false
instead. This will make the
turnaround time quite a bit faster, as the primary-key lookups
required to resolve deletes can be costly. However, if you're using
Lucene's trunk (to be eventually released as 4.0), another option is
to use MemoryCodec
on your id
field
to greatly
reduce the primary-key lookup time.
Note that some or even all of the previous deletes may still be applied even if you pass
false
. Also, the pending
deletes are never lost if you pass false
: they
remain buffered and will still eventually be applied.
If you have some searches that can tolerate unapplied deletes and others that cannot, it's perfectly fine to create two
SearcherManager
s, one applying deletes and one not.
If you pass a non-null
ExecutorService
, then each segment
in the index can be searched concurrently; this is a way to gain
concurrency within a single search request. Most applications do not
require this, because the concurrency across multiple searches is
sufficient. It's also not clear that this is effective in general as
it adds per-segment overhead, and the available concurrency is a
function of your index structure. Perversely, a fully optimized index
will have no concurrency! Most applications should pass
null
.
NRTManager
What if you want the fast turnaround time of NRT readers, but need control over when specific index changes become visible to certain searches? Use
NRTManager
!
NRTManager
holds onto the IndexWriter
instance you provide and then exposes the same APIs for making index
changes (addDocument/s
, updateDocument/s
,
deleteDocuments
). These methods forward to the
underlying IndexWriter
, but then return a
generation token (a Java long
) which you can
hold onto after making any given change. The generation only
increases over time, so if you make a group of changes, just keep the
generation returned from the last change you made.
Then, when a given search request requires certain changes to be visible, pass that generation back to
NRTManager
to obtain a searcher that's guaranteed to
reflect all changes for that generation.
Here's one example use-case: let's say your site has a forum, and you use Lucene to index and search all posts in the forum. Suddenly a user, Alice, comes online and adds a new post; in your server, you take the text from Alice's post and add it as a document to the index, using
NRTManager.addDocument
, saving the returned generation.
If she adds multiple posts, just keep the last generation.
Now, if Alice stops posting and runs a search, you'd like to ensure her search covers all the posts she just made. Of course, if your reopen time is fast enough (say once per second), unless Alice types very quickly, any search she runs will already reflect her posts.
But pretend for now you reopen relatively infrequently (say once every 5 or 10 seconds), and you need to be certain Alice's search covers her posts, so you call
NRTManager.waitForGeneration
to obtain
the SearcherManager
to use for searching. If the latest
searcher already covers the requested generation, the method returns
immediately. Otherwise, it blocks, requesting a reopen (see below),
until the required generation has become visible in a searcher, and
then returns it.
If some other user, say Bob, doesn't add any posts and runs a search, you don't need to wait for Alice's generation to be visible when obtaining the searcher, since it's far less important when Alice's changes become immediately visible to Bob. There's (usually!) no causal connection between Alice posting and Bob searching, so it's fine for Bob to use the most recent searcher.
Another use-case is an index verifier, where you index a document and then immediately search for it to perform end-to-end validation that the document "made it" correctly into the index. That immediate search must first wait for the returned generation to become available.
The power of
NRTManager
is you have full control over
which searches must see the effects of which indexing changes; this is
a further improvement in Lucene's controlled consistency
model. NRTManager
hides all the tricky details of
tracking generations.
But: don't abuse this! You may be tempted to always wait for last generation you indexed for all searches, but this would result in very low search throughput on concurrent hardware since all searches would bunch up, waiting for reopens. With proper usage, only a small subset of searches should need to wait for a specific generation, like Alice; the rest will simply use the most recent searcher, like Bob.
Managing reopens is a little trickier with
NRTManager
,
since you should reopen at higher frequency whenever a search is
waiting for a specific generation. To address this, there's the
useful NRTManagerReopenThread
class; use it like this:
double minStaleSec = 0.025; double maxStaleSec = 5.0; NRTManagerReopenThread thread = new NRTManagerReopenThread( nrtManager, maxStaleSec, minStaleSec); thread.start(); ... thread.close();The
minStaleSec
sets an upper bound on the time a user must wait before the search can run. This is used whenever a searcher is waiting for
a specific generation (Alice, above), meaning the longest such a search
should have to wait is approximately 25 msec.
The
maxStaleSec
sets a lower bound on how frequently
reopens should occur. This is used for the periodic "ordinary"
reopens, when there is no request waiting for a specific generation
(Bob, above); this means any changes done to the index more than
approximately 5.0 seconds ago will be seen when Bob searches. Note
that these parameters are approximate targets and not hard guarantees
on the reader turnaround time. Be sure to eventually
call thread.close()
, when you are done reopening (for
example, on shutting down the application).
You are also free to use your own strategy for calling
maybeReopen
; you don't have to
use NRTManagerReopenThread
. Just remember that getting
it right, especially when searches are waiting for specific
generations, can be tricky!
Tuesday, October 25, 2011
Accuracy and performance of Google's Compact Language Detector
- Apache Tika, implemented in Java, using its LanguageIdentification class
-
language-detection
, a project on Google code, also implemented in Java
For the test corpus I used a the corpus described here, created by the author of
language-detection
. It contains 1000 texts from each of
21 languages, randomly sampled from the Europarl corpus.
It's not a perfect test (no test ever is!): the content is already very clean plain text; there are no domain, language, encoding hints to apply (which you'd normally have with HTML content loaded over HTTP); it "only" covers 21 languages (versus at least 76 that CLD can detect).
CLD and
language-detection
cover all 21 languages, but
Tika is missing Bulgarian (bg
), Czech (cs
),
Lithuanian (lt
) and Latvian (lv
), so I only
tested on the remaining subset of 17 languages that all three detectors
support. This works out to 17,000 texts totalling 2.8 MB.
Many of the texts are very short, making the test challenging: the shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or less.
In addition to the challenges of the corpora, the differences in the detectors make the comparison somewhat apples to oranges. For example, CLD detects at least 76 languages, while
language-detection
detects 53 and Tika detects 27,
so this biases against CLD, and language-detection
to a
lesser extent, since their classification task is harder relative to
Tika's.
For CLD, I disabled its option to abstain (
removeWeakMatches
), so that it always
guesses at the language even when confidence is low, to match the
other two detectors. I also turned off the
pickSummaryLanguage
, as this was also hurting accuracy;
now CLD simply picks the highest scoring match as the detected
language.
For
language-detection
, I ran with the default
ALPHA
of 0.5, and set the random seed to 0.
Here are the raw results:
CLD results (total 98.82% = 16800 / 17000):
da | 93.4% | da=934 | nb=54 | sv=5 | fr=2 | eu=2 | is=1 | hr=1 | en=1 | |
de | 99.6% | de=996 | en=2 | ga=1 | cy=1 | |||||
el | 100.0% | el=1000 | ||||||||
en | 100.0% | en=1000 | ||||||||
es | 98.3% | es=983 | pt=4 | gl=3 | en=3 | it=2 | eu=2 | id=1 | fi=1 | da=1 |
et | 99.6% | et=996 | ro=1 | id=1 | fi=1 | en=1 | ||||
fi | 100.0% | fi=1000 | ||||||||
fr | 99.2% | fr=992 | en=4 | sq=2 | de=1 | ca=1 | ||||
hu | 99.9% | hu=999 | it=1 | |||||||
it | 99.5% | it=995 | ro=1 | mt=1 | id=1 | fr=1 | eu=1 | |||
nl | 99.5% | nl=995 | af=3 | sv=1 | et=1 | |||||
pl | 99.6% | pl=996 | tr=1 | sw=1 | nb=1 | en=1 | ||||
pt | 98.7% | pt=987 | gl=4 | es=3 | mt=1 | it=1 | is=1 | ht=1 | fi=1 | en=1 |
ro | 99.8% | ro=998 | da=1 | ca=1 | ||||||
sk | 98.8% | sk=988 | cs=9 | en=2 | de=1 | |||||
sl | 95.1% | sl=951 | hr=32 | sr=8 | sk=5 | en=2 | id=1 | cs=1 | ||
sv | 99.0% | sv=990 | nb=9 | en=1 |
Tika results (total 97.12% = 16510 / 17000):
da | 87.6% | da=876 | no=112 | nl=4 | sv=3 | it=1 | fr=1 | et=1 | en=1 | de=1 | ||||
de | 98.5% | de=985 | nl=3 | it=3 | da=3 | sv=2 | fr=2 | sl=1 | ca=1 | |||||
el | 100.0% | el=1000 | ||||||||||||
en | 96.9% | en=969 | no=10 | it=6 | ro=4 | sk=3 | fr=3 | hu=2 | et=2 | sv=1 | ||||
es | 89.8% | es=898 | gl=47 | pt=22 | ca=15 | it=6 | eo=4 | fr=3 | fi=2 | sk=1 | nl=1 | et=1 | ||
et | 99.1% | et=991 | fi=4 | fr=2 | sl=1 | no=1 | ca=1 | |||||||
fi | 99.4% | fi=994 | et=5 | hu=1 | ||||||||||
fr | 98.0% | fr=980 | sl=6 | eo=3 | et=2 | sk=1 | ro=1 | no=1 | it=1 | gl=1 | fi=1 | es=1 | de=1 | ca=1 |
hu | 99.9% | hu=999 | ca=1 | |||||||||||
it | 99.4% | it=994 | eo=4 | pt=1 | fr=1 | |||||||||
nl | 97.8% | nl=978 | no=8 | de=3 | da=3 | sl=2 | ro=2 | pl=1 | it=1 | gl=1 | et=1 | |||
pl | 99.1% | pl=991 | sl=3 | sk=2 | ro=1 | it=1 | hu=1 | fi=1 | ||||||
pt | 94.4% | pt=944 | gl=48 | hu=2 | ca=2 | it=1 | et=1 | es=1 | en=1 | |||||
ro | 99.3% | ro=993 | is=2 | sl=1 | pl=1 | it=1 | hu=1 | fr=1 | ||||||
sk | 96.2% | sk=962 | sl=21 | pl=13 | it=2 | ro=1 | et=1 | |||||||
sl | 98.5% | sl=985 | sk=7 | et=4 | it=2 | pt=1 | no=1 | |||||||
sv | 97.1% | sv=971 | no=15 | nl=6 | da=6 | de=1 | ca=1 |
Language-detection
results (total 99.22% = 16868 / 17000):
da | 97.1% | da=971 | no=28 | en=1 | |||
de | 99.8% | de=998 | da=1 | af=1 | |||
el | 100.0% | el=1000 | |||||
en | 99.7% | en=997 | nl=1 | fr=1 | af=1 | ||
es | 99.5% | es=995 | pt=4 | en=1 | |||
et | 99.6% | et=996 | fi=2 | de=1 | af=1 | ||
fi | 99.8% | fi=998 | et=2 | ||||
fr | 99.8% | fr=998 | sv=1 | it=1 | |||
hu | 99.9% | hu=999 | id=1 | ||||
it | 99.8% | it=998 | es=2 | ||||
nl | 97.7% | nl=977 | af=21 | sv=1 | de=1 | ||
pl | 99.9% | pl=999 | nl=1 | ||||
pt | 99.4% | pt=994 | es=3 | it=1 | hu=1 | en=1 | |
ro | 99.9% | ro=999 | fr=1 | ||||
sk | 98.7% | sk=987 | cs=8 | sl=2 | ro=1 | lt=1 | et=1 |
sl | 97.2% | sl=972 | hr=27 | en=1 | |||
sv | 99.0% | sv=990 | no=8 | da=2 |
Some quick analysis:
- The language-detection library gets the best accuracy, at 99.22%,
followed by CLD, at 98.82%, followed by Tika at 97.12%.
Net/net these accuracies are very good, especially considering how
short some of the tests are!
- The difficult languages are Danish (confused with Norwegian),
Slovene (confused with Croatian) and Dutch (for Tika and
language-detection
). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.
When
language-detection
was wrong, Tika was also
wrong 37% of the time and CLD was also wrong 23% of the time. These
numbers are quite low! It tells us that the errors are somewhat
orthogonal, i.e. the libraries tend to get different test cases wrong.
For example, it's not the case that they are all always wrong on the short
texts.
This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.
You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with
language-detection
choice. This gives the best accuracy of all: total 99.59% (= 16930 /
17000)!
Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:
CLD | 171 msec | 16.331 MB/sec |
language-detection | 2367 msec | 1.180 MB/sec |
Tika | 42219 msec | 0.066 MB/sec |
CLD is incredibly fast!
language-detection
is an order
of magnitude slower, and Tika is another order of magnitude slower
(not sure why).
I used the 09-13-2011 release of
language-detection
, the current
trunk (svn revision 1187915) of Apache Tika,
and the current trunk (hg revision b0adee43f3b1) of CLD.
All sources for the performance tests are available from here.
Monday, October 24, 2011
Additions to Compact Language Detector API
I've made some small improvements after my quick initial port of Google's Compact Language Detection Library, starting with some helpful Python constants:
-
cld.ENCODINGS
has all the encoding names recognized by CLD; if you pass the encoding hint it must be one of these.
-
cld.LANGUAGES
has the list of all base languages known (but not necessarily detectable) by CLD.
-
cld.EXTERNAL_LANGUAGES
has the list of external languages known (but not necessarily detectable) by CLD.
-
cld.DETECTED_LANGUAGES
has the list of detectable languages.
I haven't found a reliable way to get the full list of detectable languages; for now, I've started with all languages that are covered by the unit test, total count 75, which should be a lower bound on the true count.
I also exposed control over whether CLD should abstain from a given matched language if the confidence is too low, by adding a parameter
removeWeakMatches
(required in C and optional in Python, default False
). Turn this option on if abstaining is OK in your use case, such as a browser toolbar offering to translate content. Turn it off when testing accuracy vs other language detection libraries (unless they also abstain!).Finally, CLD has an algorithm that tries to pick the best "summary" language, and it doesn't always just pick the highest scoring match. For example, the code has this comment:
// If English and X, where X (not UNK) is big enough, // assume the English is boilerplate and return X.See the CalcSummaryLanguage function for more details!
I found this was hurting accuracy in testing so I added a parameter
pickSummaryLanguage
(default False
) to also turn this on or off.Finally, I fixed the Python binding to release the GIL while CLD is running, so multiple threads can now detect without falsely blocking one another.