TokenStream is actually a chain, starting with
that splits characters into initial tokens, followed by any number of
that modify the tokens. You can also use
to pre-process the characters before tokenization, for example to
out HTML markup,
characters according to a regular expression, while preserving the proper offsets back
into the original input string. Analyzer is the factory class that
TokenStreams when needed.
Lucene and Solr have a wide variety of
TokenFilters, including support
for at least 34
Let's tokenize a simple example:
fast wi fi network is
down. Assume we
preserve stop words. When viewed as a graph, the tokens look like
Each node is a position, and each arc is a token. The
enumerates a directed
acyclic graph, one arc at a time.
Next, let's add SynoynmFilter into our analysis chain, applying these synonyms:
wi fi network→
Now the graph is more interesting! For each token (arc), the PositionIncrementAttribute tells us how many positions (nodes) ahead this arc starts from, while the new (as of 3.6.0) PositionLengthAttribute tells us how many positions (nodes) ahead the arc arrives to.
SynonymFilter, several other analysis components now produce token graphs.
outputs the decompounded form for compound tokens. For
example, tokens like ショッピングセンター (shopping center) will also
have an alternate path with ショッピング (shopping) followed by センター (center).
set the position length to 2 when they merge two input tokens.
Other analysis components should produce a graph but don't yet (patches welcome!): WordDelimiterFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, NGramTokenFilter, EdgeNGramTokenFilter, and likely others.
LimitationsThere are unfortunately several hard-to-fix problems with token graphs. One problem is that the indexer completely ignores
PositionLengthAttribute; it only pays attention
PositionIncrementAttribute. This means the indexer
acts as if all arcs always arrive at the very next position, so for the above graph
we actually index this:
This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi"). Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage. This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.
QueryParser also ignores position length, however this should be easier to fix. This would mean you can run graph analyzers at query time (i.e., query time expansion) and get the correct results.
Another problem is that
SynonymFilter also unexpectedly
performs its own form of sausagization when the injected synonym is
more than one token. For example if you have this rule:
domain name service
name was overlapped onto
service was overlapped onto
up. It's an
odd word salad!
This of course also messes up phrase queries ("domain name service is up" should match but doesn't, while "dns name up" shouldn't match but does). To work around this problem you should ensure all of your injected synonyms are single tokens! For this case, you could run the reverse mapping (
domain name service →
dns) at query
time (as well as indexing time) and then both queries
domain name service will match any document
containing either variant.
This happens because
SynonymFilter never creates new
positions; if it did so, it could make new positions for
domain name service, and then
dns to position length 3.
Another problem is that
SynonymFilter, like the indexer,
also ignores the position length of the incoming tokens: it cannot
properly consume a token graph. So if you added a
SynonymFilter it would fail to match
hotspot is down.
We've only just started but bit by bit our token streams are producing graphs!