Wednesday, August 14, 2013

SuggestStopFilter carefully removes stop words for suggesters

Lucene now has a nice set of suggesters that use an analyzer to tokenize the suggestions: AnalyzingSuggester, FuzzySuggester and AnalyzingInfixSuggester. Using an analyzer is powerful because it lets you customize exactly how suggestions are matched: you can normalize case, apply stemming, match across different synonym forms, etc.

One of the most common things you'll do with your analyzer is to remove stop-words using StopFilter. Unfortunately, if you try this, you'll quickly notice that the stop filter is too aggressive because it happily removes the last token even if the user isn't done typing it yet. For example if the user has typed "a", you'd expect suggestions like apple, aardvark, etc., but you won't get that because StopFilter removed the "a" token.

You could try using StopFilter only while indexing, which was my first attempt with the suggestions at jirasearch.mikemccandless.com, but then, at least for AnalyzingInfixSuggester, you'll fail to get matches when you pass allTermsRequired=true because the suggester then requires that even stop words find matches.

Finally, you could use the new StopSuggestFilter at lookup time: this filter is just like StopFilter except when the token is the very last token, it checks the offset for that token and if the offset indicates that the token has ended without any further non-token characters, then the token is preserved. The token is also marked as a keyword, so that any later stem filters won't change it. This way a query "a" can find "apple", but a query "a " (with a trailing space) will find nothing because the "a" will be removed.

I've pushed StopSuggestFilter to jirasearch.mikemccandless.com and it seems to be working well so far!

Friday, August 2, 2013

A new version of the Compact Language Detector

It's been almost two years since I originally factored out the fast and accurate Compact Language Detector from the Chromium project, and the effort was clearly worthwhile: the project is popular and others have created additional bindings for languages including at least Perl, Ruby, R, JavaScript, PHP and C#/.NET.

Eric Fischer used CLD to create the colorful Twitter language map, and since then further language maps have appeared, e.g. for New York and London. What a multi-lingual world we live in!

Suddenly, just a few weeks ago, I received an out-of-the-blue email from Dick Sites, creator of CLD, with great news: he was finishing up version 2.0 of CLD and had already posted the source code on a new project.

So I've now reworked the Python bindings and ported the unit tests to Python (they pass!) to take advantage of the new features. It was much easier this time around since the CLD2 sources were already pulled out into their own project (thank you Dick and Google!).

There are a number of improvements over the previous version of CLD:
  • Improved accuracy.

  • Upgraded to Unicode 6.2 characters.

  • More languages detected: 83 languages, up from 64 previously.

  • A new "full language table" detector, available in Python as a separate cld2full module, that detects 161 languages. This increases the C library size from 1.8 MB (for 83 languages) to 5.5 MB (for 161 languages). Details are here.

  • An option to identify which parts (byte ranges) of the text contain which language, in case the application needs to do further language-specific processing. From Python, pass the optional returnVectors=True argument to get the byte ranges, but note that this requires additional non-trivial CPU cost. This wiki page shows very interesting statistics on how frequently different languages appear in one page, across top web sites, showing the importance of handling multiple languages in a single text input.

  • A new hintLanguageHTTPHeaders parameter, which you can pass from the Content-Language HTTP header. Also, CLD2 will spot any lang=X attribute inside the <html> tag itself (if you pass it HTML).
In the new Python bindings, I've exposed CLD2's debug* flags, to add verbosity to CLD2's detection process. This document describes how to interpret the resulting output.

The detect function returns up to 3 top detected languages. Each detected language includes the percent of the text that was detected as the language, and a confidence score. The function no longer returns a single "picked" summary language, and the pickSummaryLanguage option has been removed: this option was apparently present for internal backwards compatibility reasons and did not improve accuracy.

Remember that the provided input must be valid UTF-8 bytes, otherwise all sorts of things could go wrong (wrong results, segmentation fault).

To see the list of detected languages, just run this python -c "import cld2; print cld2.DETECTED_LANGUAGES", or python -c "import cld2full; print cld2full.DETECTED_LANGUAGES" to see the full set of languages.

The README gives details on how to build and install CLD2.

Once again, thank you Google, and thank you Dick Sites for making this very useful library available to the world as open-source.