Monday, April 30, 2012

Lucene's TokenStreams are actually graphs!

Lucene's TokenStream class produces the sequence of tokens to be indexed for a document's fields. The API is an iterator: you call incrementToken to advance to the next token, and then query specific attributes to obtain the details for that token. For example, CharTermAttribute holds the text of the token; OffsetAttribute has the character start and end offset into the original string corresponding to this token, for highlighting purposes. There are a number of standard token attributes, and some tokenizers add their own attributes.

The TokenStream is actually a chain, starting with a Tokenizer that splits characters into initial tokens, followed by any number of TokenFilters that modify the tokens. You can also use a CharFilter to pre-process the characters before tokenization, for example to strip out HTML markup, remap character sequences or replace characters according to a regular expression, while preserving the proper offsets back into the original input string. Analyzer is the factory class that creates TokenStreams when needed.

Lucene and Solr have a wide variety of Tokenizers and TokenFilters, including support for at least 34 languages.

Let's tokenize a simple example: fast wi fi network is down. Assume we preserve stop words. When viewed as a graph, the tokens look like this:



Each node is a position, and each arc is a token. The TokenStream enumerates a directed acyclic graph, one arc at a time.

Next, let's add SynoynmFilter into our analysis chain, applying these synonyms:
  • fastspeedy
  • wi fiwifi
  • wi fi networkhotspot
resulting in this graph:



Now the graph is more interesting! For each token (arc), the PositionIncrementAttribute tells us how many positions (nodes) ahead this arc starts from, while the new (as of 3.6.0) PositionLengthAttribute tells us how many positions (nodes) ahead the arc arrives to.

Besides SynonymFilter, several other analysis components now produce token graphs. Kuromoji's JapaneseTokenizer outputs the decompounded form for compound tokens. For example, tokens like ショッピングセンター (shopping center) will also have an alternate path with ショッピング (shopping) followed by センター (center). Both ShingleFilter and CommonGramsFilter set the position length to 2 when they merge two input tokens.

Other analysis components should produce a graph but don't yet (patches welcome!): WordDelimiterFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, NGramTokenFilter, EdgeNGramTokenFilter, and likely others.

Limitations

There are unfortunately several hard-to-fix problems with token graphs. One problem is that the indexer completely ignores PositionLengthAttribute; it only pays attention to PositionIncrementAttribute. This means the indexer acts as if all arcs always arrive at the very next position, so for the above graph we actually index this:



This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi"). Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage. This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.

QueryParser also ignores position length, however this should be easier to fix. This would mean you can run graph analyzers at query time (i.e., query time expansion) and get the correct results.

Another problem is that SynonymFilter also unexpectedly performs its own form of sausagization when the injected synonym is more than one token. For example if you have this rule:
  • dnsdomain name service
it results in graphs like this:



Notice how name was overlapped onto is, and service was overlapped onto up. It's an odd word salad!

This of course also messes up phrase queries ("domain name service is up" should match but doesn't, while "dns name up" shouldn't match but does). To work around this problem you should ensure all of your injected synonyms are single tokens! For this case, you could run the reverse mapping (domain name servicedns) at query time (as well as indexing time) and then both queries dns and domain name service will match any document containing either variant.

This happens because SynonymFilter never creates new positions; if it did so, it could make new positions for tokens in domain name service, and then change dns to position length 3.

Another problem is that SynonymFilter, like the indexer, also ignores the position length of the incoming tokens: it cannot properly consume a token graph. So if you added a second SynonymFilter it would fail to match hotspot is down.

We've only just started but bit by bit our token streams are producing graphs!

32 comments:

  1. thank you so much, it is very insightful!

    ReplyDelete
  2. Hi Mike,

    I'm trying to figure out how to use the SynonymFilter and was wondering if you could share the code for your analysis chain. I've tried Googling for help, but haven't had much luck.

    Thank you!

    ReplyDelete
  3. Hi Anonymous,

    Maybe try looking at the unit tests we have for syn filter?

    https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/analysis/common/src/test/org/apache/lucene/analysis/synonym/

    ReplyDelete
  4. Thanks, Mike. I don't know why I didn't think of that. When I Googled for SynonmFilter it turned up mostly links to the code. Maybe I'll have to start a blog and document my learning process. Right now I have a working SynonymFilter but I want to use an external file to define my synonyms, so I'll be playing with SolrSynonymParse next.

    ReplyDelete
  5. Hi rrs,

    Yes please create a blog post describing how to use SynonymFilter, and/or patches to improve the javadocs!

    ReplyDelete
  6. Hi Mike,

    Thanks for the detailed description!
    I think one major problem of Lucene is what you mentioned:
    "One problem is that the indexer completely ignores PositionLengthAttribute"

    Basically, it is impossible to handle multi-word synonyms with phrase queries when the query string contains overlapping terms.
    Do you know or plan any workaround for this?

    Thanks!

    ReplyDelete
  7. Hi Csáb,

    One possible workaround is to call PhraseQuery.setSlop: this allows some "fuzziness" in the exact ordering of the words.

    ReplyDelete
  8. Great post. A couple of questions. Is the position length even stored in the index? Are there plans to add this to Lucene in some future release?

    ReplyDelete
  9. Hi Nicole,

    The position length is not stored in the index ... I don't think there are plans to do so (but patches are always welcome!). Even if we did that, we'd also need to improve queries (or make new queries) that pay attention to that attribute (eg to do PhraseQuery correctly for a graph)...

    ReplyDelete
  10. Great post! Thanks!

    ReplyDelete
  11. Hi Mike
    Do you know of any workaround for that problem?
    I am looking for a way to index linguistic annotations that can span over several tokens (i.e. named entities)
    As I understood the query stack must be rewritten to be able to search.
    Do you know of any implementation sample code or tutorial for lucene 4?
    Thanks in advance
    Regards
    Sebastien

    ReplyDelete
    Replies
    1. Workaround for which problem? The fact that posLength is not added to the index? I don't know of any workaround currently ... you might be able to index into a payload somehow.

      However you can still index and search annotations spanning multiple tokens already today, it's just that the searching of positional queries (e.g. PhraseQuery) won't be 100% correct.

      Delete
  12. Hello Sir,
    I am new to Solr Technology and I want to use SolrTextTagger in my project but was failed to setup it up on my machine by following the instructions given on David Smiley Github Repository.Can you please share some resources related to SolrTextTagger which might be helpful.
    Thanks.

    ReplyDelete
    Replies
    1. Hi Anonymous,

      Can you email solr-user@lucene.apache.org?

      Delete
  13. Does the indexer still ignore PositionLengthAttribute? Is there a ticket tracking that?

    ReplyDelete
    Replies
    1. Hi Matt,

      Yes, it still ignores it. I don't think we have a Jira issue open to add it to the index ... it's not clear there's a solid use-case that requires it ... and adding it would be quite a bit of work. You could use payload today to encode it, just using a custom TokenFilter.

      Delete
    2. Thanks, I will try the payload route. Our use case is synonym expansion of controlled vocabularies with many synonyms of varying lengths. We would like phrase queries to work with all the synonyms.

      i.e "United States voted on NATO resolution" or "US voted on North Atlantic Treaty Organization resolution" returning the same results.

      Delete
  14. In case anyone is interested:
    https://issues.apache.org/jira/browse/LUCENE-4312
    https://issues.apache.org/jira/browse/LUCENE-3843

    ReplyDelete
  15. Hi Sir,

    I'm looking tokenizer / filter the following format :
    eg: "fast wi fi network is down". If using solr.StandardTokenizerFactory , I have the "Position " corresponding to displayed : fast ( 1 ) - > wi ( 2 ) - > fi ( 3 ) - > Network ( 4 ) - > is ( 5 ) - - > down ( 6 ) . But I need you just create a new custom or class to the question above is "fast wi fi network is down" but the analysis is currently Position as follows : fast ( 1 ) - > fi ( 2 ) - > is ( 3 ) or wi ( 1 ) - > network ( 2 ) - > down ( 3 ) . I know it involves startOffset , endOffset ... but I can not figure out how to solve .
    Thanks in advance!

    ReplyDelete
    Replies
    1. Hi, can you ask your question on the Lucene user's list? (java-user@lucene.apache.org)

      I can't tell whether you have a matching issue (PhraseQuery matching when it should not, or vice-versa) or a highlighting issue.

      Delete
  16. Hi Mike,

    I'm trying to find topics about Custum PositionFilterFactory in solr 5.2.1 version.
    Do you know for this?

    ReplyDelete
  17. Hello sir,

    I'm trying to find how to get each Odd token output
    eg: input: a, b, c, d, e, f
    and output (get tokens): a, c, e
    Do you know of any resolve for that problem?
    Thank in advance!

    ReplyDelete
    Replies
    1. This is an easy custom TokenFilter; it's incrementToken would just increment the incoming Tokenizer, discard the first one, and keep the second one (return to its caller).

      Delete
    2. Yes, I know but how specifically? Would you mind showing?
      code:
      @Override
      public final boolean incrementToken() throws IOException {
      if(!input.incrementToken()){
      return false;
      }else{

      // ?? }
      return true;

      }

      Delete
    3. Well, inside the else, you should simply do "return input.incrementToken();". This way the results of the first input.incrementToken() are effectively discarded. This returns each odd token...

      Delete
    4. Thank you for answers, I'm with you.
      And. I am new to the techniques Solr and can be difficult to solve a few problems. Would you mind having your email, please?

      Delete
    5. Instead of my personal email, can you please direct questions to Lucene's user's list (java-user@lucene.apache.org)?

      Delete
  18. I ask this for Edwin, he experienced a strange problem on highlighting, can you analyze the possible cause of his problem. This is the original post from hm: https://github.com/huaban/jieba-analysis/issues/16

    ReplyDelete
  19. Hello Sir,
    I am using a decompounding in my application. My problem is the results i get even if any one of the term is present for ex: rotwein , It gets tokenized into "rotwein","rot","wein".This is parsed at query time to( rotwein or rot or wein) . But i want this to be only (rot and wein). How can i achieve this. Any pointers would be of great help.

    ReplyDelete
    Replies
    1. You need to somehow configure or fix your decompounder not to emit the "rotwein" term.

      Delete