Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.
Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar.
It turns out the CLD part of the Chromium source tree is nicely standalone, so I pulled it out into a new separate Google code project, making it possible to use CLD directly from any C++ code.
I also added basic initial Python binding (one method!), and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).
So detecting language is now very simple from Python:
import cld topLanguageName = cld.detect(bytes)The detect method returns a tuple, including the language name and code (such as
isReliable boolean (
True if CLD is quite sure of itself), the number of actual text bytes processed, and then details for each of the top languages (up to 3) that were identified.
You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand.
You can also optionally provide hints to the detect method, including the declared encoding and language (for example, from an HTTP header or an embedded
META http-equiv tag in the HTML), as well as the domain name suffix (so the top level domain suffix
es would boost the chances for detecting Spanish). CLD uses these hints to boost the priors for certain languages. There is this fun comment in the code in front of the tables holding the per-language prior boots:
Generated by dsites 2008.07.07 from 10% of BaseHow I wish I too could build tables off of 10% of Base!
The code itself looks very cool and I suspect (but haven't formally verified!) its quite accurate. I only understand bits and pieces about how it works; you can read some details here and here.
It's also not clear just how many languages it can detect; I see there are 161 "base" languages plus 44 "extended" languages, but then I see many test cases (102 out of 166!) commented out. This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.
This port is all still very new, and I extracted CLD quickly, so likely there are some problems still to work out, but the fact that it passes the Python unit test is encouraging. The README.txt has some more details.
Thank you Google!