Changing Bits: Language detection with Google's Compact Language Detector

Friday, October 21, 2011

Language detection with Google's Compact Language Detector

Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.

Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar.

It turns out the CLD part of the Chromium source tree is nicely standalone, so I pulled it out into a new separate Google code project, making it possible to use CLD directly from any C++ code.

I also added basic initial Python binding (one method!), and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).

So detecting language is now very simple from Python:

    import cld
    topLanguageName = cld.detect(bytes)[0]

The detect method returns a tuple, including the language name and code (such as RUSSIAN, ru), an isReliable boolean (True if CLD is quite sure of itself), the number of actual text bytes processed, and then details for each of the top languages (up to 3) that were identified.

You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand.

You can also optionally provide hints to the detect method, including the declared encoding and language (for example, from an HTTP header or an embedded META http-equiv tag in the HTML), as well as the domain name suffix (so the top level domain suffix es would boost the chances for detecting Spanish). CLD uses these hints to boost the priors for certain languages. There is this fun comment in the code in front of the tables holding the per-language prior boots:

    Generated by dsites 2008.07.07 from 10% of Base

How I wish I too could build tables off of 10% of Base!

The code itself looks very cool and I suspect (but haven't formally verified!) its quite accurate. I only understand bits and pieces about how it works; you can read some details here and here.

It's also not clear just how many languages it can detect; I see there are 161 "base" languages plus 44 "extended" languages, but then I see many test cases (102 out of 166!) commented out. This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.

This port is all still very new, and I extracted CLD quickly, so likely there are some problems still to work out, but the fact that it passes the Python unit test is encouraging. The README.txt has some more details.

Thank you Google!

78 comments:

capnfabsOctober 22, 2011 at 7:17 AM
Nice one! Great find!
ReplyDelete
Replies
Dhruv MataniOctober 22, 2011 at 8:30 AM
Can be handy!
ReplyDelete
Replies
BenoitOctober 22, 2011 at 1:53 PM
Thanks a lot! Will be very useful.
ReplyDelete
Replies
Max BolingbrokeOctober 22, 2011 at 3:04 PM
You can use Mozilla's libcharsetdetect to guess the encoding for the UTF8 conversion. I've packaged a standalone version of the code here: https://github.com/batterseapower/libcharsetdetect
ReplyDelete
Replies
Lars StrojnyOctober 22, 2011 at 6:32 PM
Thank you very much. I’ve took your Python bindings and provided some for PHP: https://github.com/lstrojny/php-ccld
ReplyDelete
Replies
KarlOctober 22, 2011 at 8:11 PM
Very cool Mike! Did you do any tests on how it does with large vs small amounts of text?
ReplyDelete
Replies
Michael McCandlessOctober 22, 2011 at 8:32 PM
Max, libcharsetdetect sounds great!
ReplyDelete
Replies
Michael McCandlessOctober 22, 2011 at 8:32 PM
Lars, awesome that you built PHP bindings! Thanks.
ReplyDelete
Replies
Michael McCandlessOctober 22, 2011 at 8:34 PM
Karl, I haven't done any real testing yet, but I am trying to compare it to the Java language detection package: http://code.google.com/p/language-detection
ReplyDelete
Replies
Lars StrojnyOctober 23, 2011 at 4:01 PM
@Mike: regarding the missing encoding implementation in the Python binding: I provide a class Encoding with all the defined encoding integers as class constants. Check regenerate-encoding-table.sh and cld_encodings.h, maybe that’s something for you too.
ReplyDelete
Replies
Michael McCandlessOctober 24, 2011 at 6:30 AM
Lars, thank you! I've poached that back, exposed it as cld.ENCODINGS, and now support the encoding hint!
ReplyDelete
Replies
Lars StrojnyOctober 24, 2011 at 3:33 PM
@Mike: cool, that allows me to remove a bunch of uglyness ;)
ReplyDelete
Replies
Eric FischerOctober 24, 2011 at 8:39 PM
Thanks for putting this together. It did a nice job of identifying the languages used on Twitter.
ReplyDelete
Replies
AnonymousOctober 26, 2011 at 2:15 AM
Very cool. I'm trying to build the python bindings, but it fails because cld_encodings.h is missing. This include appears on Revision: 1c4ed384ca54 (I also set an issue on google code)
ReplyDelete
Replies
Michael McCandlessOctober 26, 2011 at 7:03 AM
Woops, sorry about that aitzol -- I forgot to "hg add" the file. I just fixed it...
ReplyDelete
Replies
Marco LuiOctober 28, 2011 at 1:30 AM
I was not aware that Chrome had a language identifier built into it, I had always assumed that it was done via queries to Google's Language Identification AJAX API. I have been developing my own approach to open-web language identification, it is available at https://github.com/saffsd/langid.py , and is based on my research that will be presented at IJCNLP2011 (http://www.ijcnlp2011.org/). I will compare my system to CLD when I can find some time to do so!
ReplyDelete
Replies
AnonymousOctober 28, 2011 at 2:44 AM
Has anyone done comparisons with http://odur.let.rug.nl/~vannoord/TextCat/ ? I would assume the major strength of CLD is in how huge their corpora are …
ReplyDelete
Replies
Michael McCandlessOctober 28, 2011 at 8:57 AM
Hi Marco,

I ran langid.py on 18 of the Europarl languages and it performs very well! 99.20% (17856 / 18000) vs the best (Java language-detection Google code) at 99.26% (17866 / 18000). Impressive! Especially considering how small the overall model is (and I love how it's just packed into a big Python string!).

da (95.4%) and sl (97.3%) are the two most challenging languages.

Also, this brings the "majority rules" (across all 4 detectors) accuracy up to 99.73% (17952/ 18000), which is awesome (means langid.py is pulling from relatively independent "signals" than the others).

Cool!
ReplyDelete
Replies
Michael McCandlessOctober 29, 2011 at 6:56 AM
Anonymous, I haven't tried TextCat but it looks like a compelling option too!
ReplyDelete
Replies
Michael McCandlessNovember 9, 2011 at 7:25 PM
Hi Janis,

Not sure why your comment isn't shown here -- I'm copy/pasting it here:

Hi! I have problems with building and installing this module on Windows7 (needed to install gcc witg mingw but when this was done, still was many errors while compiling) and Linux Ubuntu (full console with many errors like "./ceval.h:125: error: expected constructor, destructor, or type conversion before ā(ā token") - any suggestions what I else need to build your code?

I was able to build on Windows, using the checked in build.win.cmd (I'm using Visual Studio 8), but I don't have the older Visual Studio installed to compile the Python bindings.

On Linux (Fedora 13) I compiled fine with build.sh, using gcc 4.4.4, and then build the python bindings using setup.py.
ReplyDelete
Replies
SaschaNovember 14, 2011 at 11:38 AM
Hi,
I'm currently trying to follow the steps to build the python bindings for the CLD under Windows 7. I (successfully) built the library using 'build.win.cmd', but when i try to run the python set up, I run into an error I don't know what to do with:

C:\Users\sascha\eclipse\workspaces\pydev\chromium-compact-language-detector>python -u setup.py build
running build
running build_ext
building 'cld' extension
C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I. -IC:\Python26\include -IC:\Python26\PC /Tppycldmodule.cc /Fobuild\temp.win32-2.6\Release\pycldmodule.obj
pycldmodule.cc
c:\users\sascha\eclipse\workspaces\pydev\chromium-compact-language-detector\base/string_util.h(23) : error C2039: 'strcasecmp' : is not a member of '`global namespace''
error: command '"C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe"' failed with exit status 2

Any suggestions would be greatly appreciated!
Thanks, Sascha
ReplyDelete
Replies
Michael McCandlessNovember 15, 2011 at 6:27 AM
Hi Sascha,

I just pushed a fix for that compilation error -- can you "hg pull -u" and see if it works?

This was due to a missing compile time define (/DWIN32) in setup.py.
ReplyDelete
Replies
AnonymousNovember 16, 2011 at 11:35 AM
thanks for this handy tool!
ReplyDelete
Replies
SaschaNovember 17, 2011 at 5:35 AM
Hi Mike,

thanks a lot for the fix, worked great now! Thank you for your help!

A note to other windows users: For some reason, something seemed to have gone wrong in renaming a file during the build. So after running

'build.win.cmd'

I had to run

'ren libcld.lib cld.lib'

before then running

'python -u setup.py build' and afterwards
'python -u setup.py install'.
ReplyDelete
Replies
Marco LuiNovember 30, 2011 at 12:49 AM
Hello Mike,

Glad to hear that langid.py is working well for you. I'm continuing to develop it, trying to get it to work well with even shorter strings in even more languages.

@Anonymous RE:TextCat, I compared my tool langid.py extensively to TextCat in my recently published paper, a copy is available at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf . Our findings were quite straightforwards, the performance of TextCat really starts to fall off as more languages and more domains are considered.
ReplyDelete
Replies
uMadd?January 5, 2012 at 9:12 AM
Hi Mike,

Thank you for providing a Python wrapper for the CLD. I compiled the latest version with MinGW and it works. One thing however: You write in the README that you made no changes to the original Chromium source but at least encodings/compact_lang_det/compact_lang_det.h differs.

Thank you again,

/David
ReplyDelete
Replies
Michael McCandlessJanuary 5, 2012 at 9:55 AM
Aha! You are right David; I actually did modify compact_lang_det.{h,cc}, and compact_lang_det_impl.{h,cc}. These files provide the "entry points" to the core CLD library... and my changes were minor: I removed a few entry points that were likely backwards-compatible layers for within Chrome (not important here), and I also opened up control over previously hardwired functionality for removing weak matches and picking the summary language.

Also, cld_encodings.h is new (I copied this from the PHP port), and it just provides mappings from the encoding constants to their string names...
ReplyDelete
Replies
HywelFebruary 4, 2012 at 7:12 PM
Thanks for your work. I can't fully reconcile the tuple returned with your description. For example, testing a string of 261 characters (with 3 languages) I get:
'ENGLISH', 'en', True, 264, [('ENGLISH', 'en', 78, 158.02269043760128), ('IRISH', 'ga', 22, 53.226879574184963)]).

I guess that 78 is the length detected as English, 22 detected as Irish, but what is the 158.022... and 53.226...? Probabilities expressed as parts of 1000?
ReplyDelete
Replies
Michael McCandlessFebruary 5, 2012 at 8:16 AM
Actually, 78 and 22 is the "percent likelihood" for the match, and then the number after that is called "normalized_score" in the code. I'm really not sure exactly how to interpret these numbers, except to say that higher numbers mean stronger matches...

The net number of bytes matched is returned at the top (ie, not per language that matched); in your case it's 264.
ReplyDelete
Replies
Alex OttMarch 28, 2012 at 4:27 AM
Hi Mike

I have several questions:
- Do you plan to make "official" release for your project?
- How often do you synchronize source with Chromium?

thank you
ReplyDelete
Replies
Michael McCandlessMarch 28, 2012 at 6:59 AM
Hi Alex,

Unfortunately I don't sync up w/ CLD at all (or at least, I haven't yet). And I wasn't planning on doing official releases...

Patches are welcome!!

Thanks,

Mike
ReplyDelete
Replies
UnknownMay 9, 2012 at 8:23 AM
Mike,

Nice work!

CLD is great for detecting the language for a given buffer. However, I need to extract the word boundaries from the buffer the as well as detect the language for each for these words.

I'm developing a search engine and I need to pass each word to the appropriate language specific stemmer.
ReplyDelete
Replies
Michael McCandlessMay 12, 2012 at 5:29 PM
Tokenizing mixed language content is definitely a challenge ... you could try Lucene's StandardAnalyzer? It tokenizes according to the Unicode UAX #29 text segmentation standard.

Maybe also email the Lucene users list (java-user@lucene.apache.org)?
ReplyDelete
Replies
MinseokMay 13, 2012 at 1:30 AM
Hi Mike,

this post awesome for me.

I have several questions:

CLD is very useful and perfectly framework.

but, i googling all about CLD. i not get a hint about using for iOS(objectiveC).

Could please hint for me to build CLD library using for iphone?

using gcc compiler armv7 (iPhoneDevice), i386 (iPhoneSimulator) It seems to be converted for use, maybe modifying build.sh script file. but, I do not know what to do.

In other ways besides the above hint. java, python, ruby, C #, node.js etc. seems to be ported. but sadly not objectiveC.

How do ported or build the CLD library for the development of iOS?

finally, iOS only availabe static library, not allowed dynamic library.

my pc os is macOSX10.6.7 Lion, xcode 4.3.2. thanks.
ReplyDelete
Replies
Michael McCandlessMay 14, 2012 at 8:32 AM
Hi Minseok,

Unfortunately I don't know much about iOS development. It could be that langid.py https://github.com/saffsd/langid.py is a better fit?
ReplyDelete
Replies
LiolikJune 19, 2012 at 11:34 AM
Hi Mike,

It's great to find your tool online!
I was trying to run it to test, however, in the new package I am missing the 'bindings' dir.. I see it in the sources, though.. How do I get the bindings dir without actually checking out the source code?

Thanks a lot!
ReplyDelete
Replies
Michael McCandlessJune 19, 2012 at 11:58 AM
Liolik,

Hmmm: can you open an issue at http://code.google.com/p/chromium-compact-language-detector/issues/list ? Thanks.

We recently reworked the packaging so something could easily be wrong ... are you using the C or Python APIs? And, which release package?
ReplyDelete
Replies
AnonymousJune 20, 2012 at 7:18 AM
Hi Mike,
Very nice work.

I've just downloaded and compiled/installed from the .tar.gz source file and the 'bindings' dir is actually missing (as mentioned by Liolik).
http://code.google.com/p/chromium-compact-language-detector/downloads/detail?name=compact-language-detector-0.1.tar.gz&can=2&q=

Ray
ReplyDelete
Replies
Michael McCandlessJune 20, 2012 at 8:16 AM
Hi Liolik, Anonymous: I put a comment on the issue ...
ReplyDelete
Replies
AnonymousJune 20, 2012 at 8:46 AM
thanks Mike, I'll give it a try.
Ray
ReplyDelete
Replies
manoj1919August 7, 2012 at 5:10 AM
Hi Mike,
I am new to python, how do i install cld module to my python. I dont know how to install new modules to python. tried looking at many blogs most tell using setup.py file by command: python setup.py install . But I dont find such a thing in the cld files. please help me i need langauge detection even its not pythonic way. so suggestions other than python are also more than welcome. Thanks a lot
Manoj
ReplyDelete
Replies
Michael McCandlessAugust 7, 2012 at 6:44 AM
Hi manoj1919,

First you need to build & install CLD from sources (see the downloads on the google code site).

Then, download the Python wrapper from pypi: http://pypi.python.org/pypi/chromium_compact_language_detector/0.1.1
and run setup.py from there.
ReplyDelete
Replies
Jérôme RichalotAugust 13, 2012 at 3:52 PM
Mike,

Have installed module and bindings apparently correctly. I have a
chromium_compact_language_detector-0.1.1-py2.7.egg-info and cld.so
in /usr/lib/python2.7/site-packages which I take as rather positive signs!

However, when in the python interpreter upon trying to import cld I get

>>> import cld
Traceback (most recent call last):
File "", line 1, in
ImportError: libcld.so.0: cannot open shared object file: No such file or directory

Any idea what I am doing wrong?

Many thanks for making this available. I hope I can use it.

Regards
ReplyDelete
Replies
Michael McCandlessAugust 13, 2012 at 4:11 PM
Hi Jérôme,

Hmm that's odd. Are you sure you're running Python2.7 when you run "import cld"?

If you import sys and print sys.path does it have /usr/lib/python2.7/site-packes in the list?
ReplyDelete
Replies
Michael McCandlessAugust 13, 2012 at 4:33 PM
Hmm does your LD_LIBRARY_PATH point to the directory (maybe /usr/local/lib?) where libcld.so is installed?
ReplyDelete
Replies
Jérôme RichalotAugust 14, 2012 at 3:50 AM
1. Hmm indeed, I do not seem to have a D_LIBRARY_PATH

[root@myhost ~]# echo $LD_LIBRARY_PATH

[root@myhost ~]#

2. libcld.so is indeed in /usr/local/lib

Must investigate... TBC
ReplyDelete
Replies
SimonNovember 20, 2012 at 7:52 AM
I had trouble getting the official project installed on Windows, so was pointed to this location for an easy_install: http://www.lfd.uci.edu/~gohlke/pythonlibs/#cld

However, I am getting encoding issues with the python bindings.

clean_text = 'a tweet from twitter'
clean_text_utf = clean_text.encode('utf-8', 'ignore')
cld.detect(clean_text_utf , pickSummaryLanguage=True, removeWeakMatches=True)

After awhile, I get 'UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4'

Any ideas on how to resolve?
ReplyDelete
Replies
Michael McCandlessNovember 20, 2012 at 8:33 AM
Hi Simon,

Can you post a small test case showing that exception? Is it coming from within CLD?
ReplyDelete
Replies
SimonNovember 21, 2012 at 5:45 AM
Working on the issue on SE here:
http://stackoverflow.com/questions/13473861/encoding-issues-with-cld
Likely not related to CLD, but my rookie python skills in getting the required string across to CLD?

One thing you might be able to help me with (BTW, your name will appear in the credits on final map I am making) is if my CLD parameters are accurate. At the moment, I am seeing some strings being misinterpreted.

some_text = "ha I dont know When can I know your news"
cld.detect(some_text, pickSummaryLanguage=True, removeWeakMatches=True)
This returns:
('FRENCH', 'fr', True, 43, [('ENGLISH', 'en', 62, 26.74230145867099), ('FRENCH', 'fr', 38, 8.2219938
33504625)])

I am grabbing the first value as the predicted language, but it is often wrong.
What is the significance of the 26 and 8 values. Are these confidence scores?

My cut down code: http://pastebin.com/3DU2BYp0
ReplyDelete
Replies
Michael McCandlessNovember 21, 2012 at 8:06 AM
Hi Simon,

I put an answer on the stackoverflow question.

That first value is in fact the predicted language, and it's clearly wrong in your example! Urgh. I confirmed I get the same results on Linux ...

Maybe try passing pickSummaryLanguage=False? There is some "smarts" in CLD that sometimes picks a weaker matching language as the choice, as happened in this example. When I pass pickSummaryLanguage=False, it gets the correct answer for your example ... and I think when I ran my benchmark I passed False.

The 26/8 are what CLD calls the "normalized score"; I'm not sure how it's computed ...
ReplyDelete
Replies
SimonNovember 21, 2012 at 5:39 PM
Mike - really appreciate the quick response.
Glad to know that your getting same results at your end. I have changed to False.

From looking at my result above, I was confused that the first result is French, but the 'normalised score' for English was higher than the score for French?
ReplyDelete
Replies
Michael McCandlessNovember 22, 2012 at 9:55 AM
I also find that confusing ... somehow whatever "smarts" is implemented in the pickSummaryLanguage=True is able to take a worse-scoring language and pick it ... I'm not sure why :) And I think in my original tests I saw worse accuracy if I passed True.
ReplyDelete
Replies
SimonNovember 23, 2012 at 7:07 PM
Thanks Mike. Im getting better results than before with the psl=t, but im thinking that I should perhaps add logic to manually loop through the list of results, and pick the one with the highest score, as opposed to relying on the first result returned. Where do you think would be the best place to query this further?
ReplyDelete
Replies
Michael McCandlessNovember 24, 2012 at 6:43 AM
I believe with pSL=False that you'll always get the top scoring language as the choice. Have you ever found a case where you didn't?
ReplyDelete
Replies
SimonNovember 27, 2012 at 5:56 AM
Your right, sticking with False. One thing I am pondering is why im not getting a lot of greek results coming through. If I go to Google Translate, I can type a basic english sentence, grab the greek, and paste it into CLD, and it returns wrong language pretty much everytime. Any ideas here?

>>> cld.detect("Σήμερα ο καιρός είναι ζεστός", pickSummaryLanguage=False, removeWeakMatches=True)
('RUSSIAN', 'ru', True, 22, [('RUSSIAN', 'ru', 31, 1.303780964797914)])
ReplyDelete
Replies
Michael McCandlessNovember 27, 2012 at 7:18 AM
Simon, you need to first encode that greek string as UTF8, eg:

>>> import codecs
>>> cld.detect(codecs.getencoder('UTF-8')(u'Σήμερα ο καιρός είναι ζεστός')[0])

('GREEK', 'el', True, 54, [('GREEK', 'el', 100, 54.0)])
ReplyDelete
Replies
AnonymousMarch 20, 2013 at 4:46 PM
hi ,
please someone can give me the steps to install cld under windows 7.
thanks
ReplyDelete
Replies
Michael McCandlessMarch 21, 2013 at 2:32 PM
Hi Anonymous,

At one point the build.win.cmd worked on windows, for just the library (not the Python wrapper), but I haven't tested this in some time ...
ReplyDelete
Replies
AnonymousApril 5, 2013 at 5:45 PM
I want to know which of these two library (langid and CLD) is better for classifying tweets
ReplyDelete
Replies
Michael McCandlessApril 6, 2013 at 1:39 PM
Hi Anonymous,

I don't know off hand which library is better ... you need to test for yourself I think.
ReplyDelete
Replies
Sylvain ZimmerApril 7, 2013 at 9:22 AM
Mike,

Just wanted to say thanks for the awesome module. I was using guess-language from pip before but CLD gave me a huge boost in result accuracy!

I can also confirm that the 0.2 version didn't work on heroku but the 0.031415 does!! So thanks again for that :)

Cheers,
ReplyDelete
Replies
Michael McCandlessApril 7, 2013 at 3:56 PM
Hi Sylvain,

I'm glad you found CLD useful. Say thank you to Google :)

That's spooky that 0.2 does NOT work but 0.031415 (the older version) does. What went wrong with 0.2?
ReplyDelete
Replies
UnknownApril 21, 2013 at 7:36 AM
I did a comparison of CLD with our own language detection API web service (WhatLanguage.net). You can read the full comparison of WhatLanguage.net, CLD, Tika, language-detection and langid.py at http://www.whatlanguage.net/en/api/accuracy_language_detection
ReplyDelete
Replies
AnonymousMay 30, 2013 at 6:17 PM
This is really a great post, thank you Mike.

I am trying to integrate CLD into my java code with no luck so far. The java wrapper here https://github.com/mzsanford/cld didn't work for 64bit machines. Any idea if there is another wrapper?

Best,
Ed
ReplyDelete
Replies
AnonymousSeptember 12, 2013 at 3:07 AM
which one is the best for the short texts?
ReplyDelete
Replies
PatDecember 19, 2013 at 5:16 PM
Hi Mike,
Just wanted to say thanks! Definitely appreciate all the initiative you took on this project. We're finding the CLD Python binding really useful.

We did come across a strange problem where CLD fails to detect the correct language when an '&' character is part of the text. I wondered if anybody else had encountered this (or maybe I've missed something obvious).

>>> clean_text = "Nation & world: Russian president says Sochi will be 'fully tolerant' of gay athletes at Olympics"
>>> clean_text_utf = clean_text.encode('utf-8', 'ignore')
>>> cld.detect(clean_text_utf , pickSummaryLanguage=True, removeWeakMatches=True)
('Unknown', 'un', True, 9, [])
>>> cld.detect(clean_text_utf)
('Unknown', 'un', True, 9, [])

if we remove the '&' character, accuracy returns to normal
>>> clean_text = "Nation world: Russian president says Sochi will be 'fully tolerant' of gay athletes at Olympics"
>>> clean_text_utf = clean_text.encode('utf-8', 'ignore')
>>> cld.detect(clean_text_utf)
('ENGLISH', 'en', True, 95, [('ENGLISH', 'en', 100, 76.985413290113456)])

There seem to be some other troublesome characters as well.

I see there is a CLD2. I don't yet have a sandbox install to see if this problem remains.

Best,

-Pat

ReplyDelete
Replies

Add comment