Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.
Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar.
It turns out the CLD part of the Chromium source tree is nicely standalone, so I pulled it out into a new separate Google code project, making it possible to use CLD directly from any C++ code.
I also added basic initial Python binding (one method!), and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).
So detecting language is now very simple from Python:
import cld topLanguageName = cld.detect(bytes)[0]The detect method returns a tuple, including the language name and code (such as
RUSSIAN
, ru
), an isReliable
boolean (True
if CLD is quite sure of itself), the number of actual text bytes processed, and then details for each of the top languages (up to 3) that were identified.You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand.
You can also optionally provide hints to the detect method, including the declared encoding and language (for example, from an HTTP header or an embedded
META http-equiv
tag in the HTML), as well as the domain name suffix (so the top level domain suffix es
would boost the chances for detecting Spanish). CLD uses these hints to boost the priors for certain languages. There is this fun comment in the code in front of the tables holding the per-language prior boots:
Generated by dsites 2008.07.07 from 10% of BaseHow I wish I too could build tables off of 10% of Base!
The code itself looks very cool and I suspect (but haven't formally verified!) its quite accurate. I only understand bits and pieces about how it works; you can read some details here and here.
It's also not clear just how many languages it can detect; I see there are 161 "base" languages plus 44 "extended" languages, but then I see many test cases (102 out of 166!) commented out. This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.
This port is all still very new, and I extracted CLD quickly, so likely there are some problems still to work out, but the fact that it passes the Python unit test is encouraging. The README.txt has some more details.
Thank you Google!
Nice one! Great find!
ReplyDeleteCan be handy!
ReplyDeleteThanks a lot! Will be very useful.
ReplyDeleteYou can use Mozilla's libcharsetdetect to guess the encoding for the UTF8 conversion. I've packaged a standalone version of the code here: https://github.com/batterseapower/libcharsetdetect
ReplyDeleteThank you very much. I’ve took your Python bindings and provided some for PHP: https://github.com/lstrojny/php-ccld
ReplyDeletelink should be: https://github.com/lstrojny/php-cld
DeleteVery cool Mike! Did you do any tests on how it does with large vs small amounts of text?
ReplyDeleteMax, libcharsetdetect sounds great!
ReplyDeleteLars, awesome that you built PHP bindings! Thanks.
ReplyDeleteKarl, I haven't done any real testing yet, but I am trying to compare it to the Java language detection package: http://code.google.com/p/language-detection
ReplyDelete@Mike: regarding the missing encoding implementation in the Python binding: I provide a class Encoding with all the defined encoding integers as class constants. Check regenerate-encoding-table.sh and cld_encodings.h, maybe that’s something for you too.
ReplyDeleteLars, thank you! I've poached that back, exposed it as cld.ENCODINGS, and now support the encoding hint!
ReplyDelete@Mike: cool, that allows me to remove a bunch of uglyness ;)
ReplyDeleteThanks for putting this together. It did a nice job of identifying the languages used on Twitter.
ReplyDeleteVery cool. I'm trying to build the python bindings, but it fails because cld_encodings.h is missing. This include appears on Revision: 1c4ed384ca54 (I also set an issue on google code)
ReplyDeleteWoops, sorry about that aitzol -- I forgot to "hg add" the file. I just fixed it...
ReplyDeleteI was not aware that Chrome had a language identifier built into it, I had always assumed that it was done via queries to Google's Language Identification AJAX API. I have been developing my own approach to open-web language identification, it is available at https://github.com/saffsd/langid.py , and is based on my research that will be presented at IJCNLP2011 (http://www.ijcnlp2011.org/). I will compare my system to CLD when I can find some time to do so!
ReplyDeleteI just wanted to add that my paper on cross-domain feature selection for language identification has been published. It is available at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf
DeleteHas anyone done comparisons with http://odur.let.rug.nl/~vannoord/TextCat/ ? I would assume the major strength of CLD is in how huge their corpora are …
ReplyDeleteHi Marco,
ReplyDeleteI ran langid.py on 18 of the Europarl languages and it performs very well! 99.20% (17856 / 18000) vs the best (Java language-detection Google code) at 99.26% (17866 / 18000). Impressive! Especially considering how small the overall model is (and I love how it's just packed into a big Python string!).
da (95.4%) and sl (97.3%) are the two most challenging languages.
Also, this brings the "majority rules" (across all 4 detectors) accuracy up to 99.73% (17952/ 18000), which is awesome (means langid.py is pulling from relatively independent "signals" than the others).
Cool!
Anonymous, I haven't tried TextCat but it looks like a compelling option too!
ReplyDeleteHi Janis,
ReplyDeleteNot sure why your comment isn't shown here -- I'm copy/pasting it here:
Hi! I have problems with building and installing this module on Windows7 (needed to install gcc witg mingw but when this was done, still was many errors while compiling) and Linux Ubuntu (full console with many errors like "./ceval.h:125: error: expected constructor, destructor, or type conversion before ā(ā token") - any suggestions what I else need to build your code?
I was able to build on Windows, using the checked in build.win.cmd (I'm using Visual Studio 8), but I don't have the older Visual Studio installed to compile the Python bindings.
On Linux (Fedora 13) I compiled fine with build.sh, using gcc 4.4.4, and then build the python bindings using setup.py.
Hi,
ReplyDeleteI'm currently trying to follow the steps to build the python bindings for the CLD under Windows 7. I (successfully) built the library using 'build.win.cmd', but when i try to run the python set up, I run into an error I don't know what to do with:
C:\Users\sascha\eclipse\workspaces\pydev\chromium-compact-language-detector>python -u setup.py build
running build
running build_ext
building 'cld' extension
C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I. -IC:\Python26\include -IC:\Python26\PC /Tppycldmodule.cc /Fobuild\temp.win32-2.6\Release\pycldmodule.obj
pycldmodule.cc
c:\users\sascha\eclipse\workspaces\pydev\chromium-compact-language-detector\base/string_util.h(23) : error C2039: 'strcasecmp' : is not a member of '`global namespace''
error: command '"C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe"' failed with exit status 2
Any suggestions would be greatly appreciated!
Thanks, Sascha
Hi Sascha,
ReplyDeleteI just pushed a fix for that compilation error -- can you "hg pull -u" and see if it works?
This was due to a missing compile time define (/DWIN32) in setup.py.
thanks for this handy tool!
ReplyDeleteHi Mike,
ReplyDeletethanks a lot for the fix, worked great now! Thank you for your help!
A note to other windows users: For some reason, something seemed to have gone wrong in renaming a file during the build. So after running
'build.win.cmd'
I had to run
'ren libcld.lib cld.lib'
before then running
'python -u setup.py build' and afterwards
'python -u setup.py install'.
Hello Mike,
ReplyDeleteGlad to hear that langid.py is working well for you. I'm continuing to develop it, trying to get it to work well with even shorter strings in even more languages.
@Anonymous RE:TextCat, I compared my tool langid.py extensively to TextCat in my recently published paper, a copy is available at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf . Our findings were quite straightforwards, the performance of TextCat really starts to fall off as more languages and more domains are considered.
I've been experimenting with Chrome CLD and langid.py, analysing 31,160 tweets containing "S4C" (the name of Wales' Welsh language TV channel - though not all references to S4C will have been to the channel). Most of the tweets are in English or Welsh: according to Chrome CLD there were 16,339 in English, 9,464 in Welsh. langid.py made it 18,219 English, 10,303 Welsh. Chrome CLD left 4,138 as "unknown". 8,981 were categorised as Welsh by both of them. Chrome CLD left 1,108 of langid.py's Welsh categorised tweets as unknown. You might be interested to see how the percentage agreement between Chrome CLD and langid.py varied by length of the tweet in this chart: http://dl.dropbox.com/u/15813120/Chrome_CLD_v_langid.py_Saes.png
Delete(I know tweets are only meant to be 140 chars and my chart shows tweets longer than that. I reckon that must be because of encoding problems which I must have failed to cope with somewhere).
Nice chart! Thanks for sharing...
DeleteThanks for sharing! The comparison is cool. Let me check that I understand 'agreement' correctly- it basically means that the two systems produced the same output for a given message?
DeleteSo if for a given message, langid.py output 'en' and cld output 'UNKNOWN', then this would be considered disagreement correct? My guess (not based on evidence!) is that cld will tend to output 'UNKNOWN' more often for shorter messages, and that this may account for some of the difference. I would be curious to see a comparison of messages where neither system labels the message 'unknown'. Also, both systems provide a measure of confidence, so you could also consider the correlation between confidence and the accuracy.
On the message length issue, I believe Twitter allows for 140 UTF8 codepoints. If I recall correctly, UTF8 can use up to 6 bytes per codepoint, allowing for theoretical upper bound of 840 bytes.
Sorry for taking so long to reply! Yes, "agreement" means the same language was detected.
DeleteI didn't get any 'unknowns' from langid.py. Excluding the 4,138 'unknowns' produced by Chrome CLD gave me this: https://dl.dropbox.com/u/15813120/no_unknowns_Chrome_CLD_v_langid.py_Saes.png i.e. much higher proportions in agreement, with the proportion dropping off when the tweet gets shorter than 70 characters. Such short tweets are less common though, as shown in this density plot: https://dl.dropbox.com/u/15813120/density_no_unknowns_Chrome_CLD_v_langid.py_Saes.png
Hi Mike,
ReplyDeleteThank you for providing a Python wrapper for the CLD. I compiled the latest version with MinGW and it works. One thing however: You write in the README that you made no changes to the original Chromium source but at least encodings/compact_lang_det/compact_lang_det.h differs.
Thank you again,
/David
Aha! You are right David; I actually did modify compact_lang_det.{h,cc}, and compact_lang_det_impl.{h,cc}. These files provide the "entry points" to the core CLD library... and my changes were minor: I removed a few entry points that were likely backwards-compatible layers for within Chrome (not important here), and I also opened up control over previously hardwired functionality for removing weak matches and picking the summary language.
ReplyDeleteAlso, cld_encodings.h is new (I copied this from the PHP port), and it just provides mappings from the encoding constants to their string names...
Thanks for your work. I can't fully reconcile the tuple returned with your description. For example, testing a string of 261 characters (with 3 languages) I get:
ReplyDelete'ENGLISH', 'en', True, 264, [('ENGLISH', 'en', 78, 158.02269043760128), ('IRISH', 'ga', 22, 53.226879574184963)]).
I guess that 78 is the length detected as English, 22 detected as Irish, but what is the 158.022... and 53.226...? Probabilities expressed as parts of 1000?
Actually, 78 and 22 is the "percent likelihood" for the match, and then the number after that is called "normalized_score" in the code. I'm really not sure exactly how to interpret these numbers, except to say that higher numbers mean stronger matches...
ReplyDeleteThe net number of bytes matched is returned at the top (ie, not per language that matched); in your case it's 264.
Hi Mike
ReplyDeleteI have several questions:
- Do you plan to make "official" release for your project?
- How often do you synchronize source with Chromium?
thank you
Hi Alex,
ReplyDeleteUnfortunately I don't sync up w/ CLD at all (or at least, I haven't yet). And I wasn't planning on doing official releases...
Patches are welcome!!
Thanks,
Mike
Mike,
ReplyDeleteNice work!
CLD is great for detecting the language for a given buffer. However, I need to extract the word boundaries from the buffer the as well as detect the language for each for these words.
I'm developing a search engine and I need to pass each word to the appropriate language specific stemmer.
Tokenizing mixed language content is definitely a challenge ... you could try Lucene's StandardAnalyzer? It tokenizes according to the Unicode UAX #29 text segmentation standard.
ReplyDeleteMaybe also email the Lucene users list (java-user@lucene.apache.org)?
Hi Mike,
ReplyDeletethis post awesome for me.
I have several questions:
CLD is very useful and perfectly framework.
but, i googling all about CLD. i not get a hint about using for iOS(objectiveC).
Could please hint for me to build CLD library using for iphone?
using gcc compiler armv7 (iPhoneDevice), i386 (iPhoneSimulator) It seems to be converted for use, maybe modifying build.sh script file. but, I do not know what to do.
In other ways besides the above hint. java, python, ruby, C #, node.js etc. seems to be ported. but sadly not objectiveC.
How do ported or build the CLD library for the development of iOS?
finally, iOS only availabe static library, not allowed dynamic library.
my pc os is macOSX10.6.7 Lion, xcode 4.3.2. thanks.
Hi Minseok,
ReplyDeleteUnfortunately I don't know much about iOS development. It could be that langid.py https://github.com/saffsd/langid.py is a better fit?
Hi Mike,
ReplyDeleteIt's great to find your tool online!
I was trying to run it to test, however, in the new package I am missing the 'bindings' dir.. I see it in the sources, though.. How do I get the bindings dir without actually checking out the source code?
Thanks a lot!
Liolik,
ReplyDeleteHmmm: can you open an issue at http://code.google.com/p/chromium-compact-language-detector/issues/list ? Thanks.
We recently reworked the packaging so something could easily be wrong ... are you using the C or Python APIs? And, which release package?
Hi Mike,
ReplyDeleteVery nice work.
I've just downloaded and compiled/installed from the .tar.gz source file and the 'bindings' dir is actually missing (as mentioned by Liolik).
http://code.google.com/p/chromium-compact-language-detector/downloads/detail?name=compact-language-detector-0.1.tar.gz&can=2&q=
Ray
Hi Liolik, Anonymous: I put a comment on the issue ...
ReplyDeletethanks Mike, I'll give it a try.
ReplyDeleteRay
Hi Mike,
ReplyDeleteI am new to python, how do i install cld module to my python. I dont know how to install new modules to python. tried looking at many blogs most tell using setup.py file by command: python setup.py install . But I dont find such a thing in the cld files. please help me i need langauge detection even its not pythonic way. so suggestions other than python are also more than welcome. Thanks a lot
Manoj
Hi manoj1919,
ReplyDeleteFirst you need to build & install CLD from sources (see the downloads on the google code site).
Then, download the Python wrapper from pypi: http://pypi.python.org/pypi/chromium_compact_language_detector/0.1.1
and run setup.py from there.
Mike,
ReplyDeleteHave installed module and bindings apparently correctly. I have a
chromium_compact_language_detector-0.1.1-py2.7.egg-info and cld.so
in /usr/lib/python2.7/site-packages which I take as rather positive signs!
However, when in the python interpreter upon trying to import cld I get
>>> import cld
Traceback (most recent call last):
File "", line 1, in
ImportError: libcld.so.0: cannot open shared object file: No such file or directory
Any idea what I am doing wrong?
Many thanks for making this available. I hope I can use it.
Regards
Hi Jérôme,
ReplyDeleteHmm that's odd. Are you sure you're running Python2.7 when you run "import cld"?
If you import sys and print sys.path does it have /usr/lib/python2.7/site-packes in the list?
Running 2.7 for sure and sys.path does have /usr/lib/python2.7/site-packages. See below.
Delete[jrichalot@myhost site-packages]$ python2
Python 2.7.2 (default, Jan 31 2012, 13:26:35)
[GCC 4.6.2 20120120 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cld
Traceback (most recent call last):
File "", line 1, in
ImportError: libcld.so.0: cannot open shared object file: No such file or directory
>>> import sys
>>> print sys.path
['', '/usr/lib/python2.7/site-packages/guardian_openplatform-0.0.2-py2.7.egg', '/usr/lib/python2.7/site-packages/simplejson-2.3.3-py2.7-linux-i686.egg', '/usr/lib/python2.7/site-packages/html5lib-0.95-py2.7.egg', '/usr/lib/python2.7/site-packages/MySQL_python-1.2.3-py2.7-linux-i686.egg', '/usr/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/usr/lib/python2.7/site-packages/pip-1.1-py2.7.egg', '/usr/lib/python2.7/site-packages/python_twitter-0.8.2-py2.7.egg', '/usr/lib/python2.7/site-packages/oauth2-1.5.211-py2.7.egg', '/usr/lib/python27.zip', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/lib/python2.7/site-packages', '/usr/lib/python2.7/site-packages/PIL', '/usr/lib/python2.7/site-packages/gst-0.10', '/usr/lib/python2.7/site-packages/gtk-2.0']
Hmm does your LD_LIBRARY_PATH point to the directory (maybe /usr/local/lib?) where libcld.so is installed?
ReplyDelete1. Hmm indeed, I do not seem to have a D_LIBRARY_PATH
ReplyDelete[root@myhost ~]# echo $LD_LIBRARY_PATH
[root@myhost ~]#
2. libcld.so is indeed in /usr/local/lib
Must investigate... TBC
And we have success
Delete[jrichalot@myhost ~]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"/usr/local/lib"
[jrichalot@myhost ~]$ echo $LD_LIBRARY_PATH
:/usr/local/lib
[jrichalot@myhost ~]$ python2
Python 2.7.2 (default, Jan 31 2012, 13:26:35)
[GCC 4.6.2 20120120 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import cld
>>>
not sure the solution is advisable and/or persistent tbh but that will allow me to to keep working.
Many thanks for your prompt and kind assistance in this matter.
I had trouble getting the official project installed on Windows, so was pointed to this location for an easy_install: http://www.lfd.uci.edu/~gohlke/pythonlibs/#cld
ReplyDeleteHowever, I am getting encoding issues with the python bindings.
clean_text = 'a tweet from twitter'
clean_text_utf = clean_text.encode('utf-8', 'ignore')
cld.detect(clean_text_utf , pickSummaryLanguage=True, removeWeakMatches=True)
After awhile, I get 'UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4'
Any ideas on how to resolve?
Hi Simon,
ReplyDeleteCan you post a small test case showing that exception? Is it coming from within CLD?
Working on the issue on SE here:
ReplyDeletehttp://stackoverflow.com/questions/13473861/encoding-issues-with-cld
Likely not related to CLD, but my rookie python skills in getting the required string across to CLD?
One thing you might be able to help me with (BTW, your name will appear in the credits on final map I am making) is if my CLD parameters are accurate. At the moment, I am seeing some strings being misinterpreted.
some_text = "ha I dont know When can I know your news"
cld.detect(some_text, pickSummaryLanguage=True, removeWeakMatches=True)
This returns:
('FRENCH', 'fr', True, 43, [('ENGLISH', 'en', 62, 26.74230145867099), ('FRENCH', 'fr', 38, 8.2219938
33504625)])
I am grabbing the first value as the predicted language, but it is often wrong.
What is the significance of the 26 and 8 values. Are these confidence scores?
My cut down code: http://pastebin.com/3DU2BYp0
Hi Simon,
ReplyDeleteI put an answer on the stackoverflow question.
That first value is in fact the predicted language, and it's clearly wrong in your example! Urgh. I confirmed I get the same results on Linux ...
Maybe try passing pickSummaryLanguage=False? There is some "smarts" in CLD that sometimes picks a weaker matching language as the choice, as happened in this example. When I pass pickSummaryLanguage=False, it gets the correct answer for your example ... and I think when I ran my benchmark I passed False.
The 26/8 are what CLD calls the "normalized score"; I'm not sure how it's computed ...
Mike - really appreciate the quick response.
ReplyDeleteGlad to know that your getting same results at your end. I have changed to False.
From looking at my result above, I was confused that the first result is French, but the 'normalised score' for English was higher than the score for French?
I also find that confusing ... somehow whatever "smarts" is implemented in the pickSummaryLanguage=True is able to take a worse-scoring language and pick it ... I'm not sure why :) And I think in my original tests I saw worse accuracy if I passed True.
ReplyDeleteThanks Mike. Im getting better results than before with the psl=t, but im thinking that I should perhaps add logic to manually loop through the list of results, and pick the one with the highest score, as opposed to relying on the first result returned. Where do you think would be the best place to query this further?
ReplyDeleteI believe with pSL=False that you'll always get the top scoring language as the choice. Have you ever found a case where you didn't?
ReplyDeleteYour right, sticking with False. One thing I am pondering is why im not getting a lot of greek results coming through. If I go to Google Translate, I can type a basic english sentence, grab the greek, and paste it into CLD, and it returns wrong language pretty much everytime. Any ideas here?
ReplyDelete>>> cld.detect("Σήμερα ο καιρός είναι ζεστός", pickSummaryLanguage=False, removeWeakMatches=True)
('RUSSIAN', 'ru', True, 22, [('RUSSIAN', 'ru', 31, 1.303780964797914)])
Simon, you need to first encode that greek string as UTF8, eg:
ReplyDelete>>> import codecs
>>> cld.detect(codecs.getencoder('UTF-8')(u'Σήμερα ο καιρός είναι ζεστός')[0])
('GREEK', 'el', True, 54, [('GREEK', 'el', 100, 54.0)])
hi ,
ReplyDeleteplease someone can give me the steps to install cld under windows 7.
thanks
Hi Anonymous,
ReplyDeleteAt one point the build.win.cmd worked on windows, for just the library (not the Python wrapper), but I haven't tested this in some time ...
there is http://www.lfd.uci.edu/~gohlke/pythonlibs/#cld an executable ready for installation . I want to know which of these two library (langid and CLD) is better for classifying tweets
DeleteI want to know which of these two library (langid and CLD) is better for classifying tweets
ReplyDeleteHi Anonymous,
ReplyDeleteI don't know off hand which library is better ... you need to test for yourself I think.
Mike,
ReplyDeleteJust wanted to say thanks for the awesome module. I was using guess-language from pip before but CLD gave me a huge boost in result accuracy!
I can also confirm that the 0.2 version didn't work on heroku but the 0.031415 does!! So thanks again for that :)
Cheers,
Hi Sylvain,
ReplyDeleteI'm glad you found CLD useful. Say thank you to Google :)
That's spooky that 0.2 does NOT work but 0.031415 (the older version) does. What went wrong with 0.2?
I did a comparison of CLD with our own language detection API web service (WhatLanguage.net). You can read the full comparison of WhatLanguage.net, CLD, Tika, language-detection and langid.py at http://www.whatlanguage.net/en/api/accuracy_language_detection
ReplyDeleteThis is really a great post, thank you Mike.
ReplyDeleteI am trying to integrate CLD into my java code with no luck so far. The java wrapper here https://github.com/mzsanford/cld didn't work for 64bit machines. Any idea if there is another wrapper?
Best,
Ed
Hi Anonymous,
DeleteI don't know anything about that port ... and I don't know of any other Java ports.
However, there is a new java port of langid.py (https://github.com/saffsd/langid.py) at https://github.com/carrotsearch/langid-java ... I did some simple tests and it gets the same results as langid.py and is quite a bit faster.
There is also the language detection library https://code.google.com/p/language-detection/
which one is the best for the short texts?
ReplyDeleteHi Anonymous,
DeleteI'm really not sure ... you should go test them and then report back! But in general short text is quite a bit harder...
Hi Mike,
ReplyDeleteJust wanted to say thanks! Definitely appreciate all the initiative you took on this project. We're finding the CLD Python binding really useful.
We did come across a strange problem where CLD fails to detect the correct language when an '&' character is part of the text. I wondered if anybody else had encountered this (or maybe I've missed something obvious).
>>> clean_text = "Nation & world: Russian president says Sochi will be 'fully tolerant' of gay athletes at Olympics"
>>> clean_text_utf = clean_text.encode('utf-8', 'ignore')
>>> cld.detect(clean_text_utf , pickSummaryLanguage=True, removeWeakMatches=True)
('Unknown', 'un', True, 9, [])
>>> cld.detect(clean_text_utf)
('Unknown', 'un', True, 9, [])
if we remove the '&' character, accuracy returns to normal
>>> clean_text = "Nation world: Russian president says Sochi will be 'fully tolerant' of gay athletes at Olympics"
>>> clean_text_utf = clean_text.encode('utf-8', 'ignore')
>>> cld.detect(clean_text_utf)
('ENGLISH', 'en', True, 95, [('ENGLISH', 'en', 100, 76.985413290113456)])
There seem to be some other troublesome characters as well.
I see there is a CLD2. I don't yet have a sandbox install to see if this problem remains.
Best,
-Pat
My guess is it's trying to parse an escaped HTML character? Are you passing isPlainText=False (this is the default). If so, can you open an issue with the CLD2 project? A lone & should be untouched ... but maybe CLD2 is doing something silly like throwing out the rest of the input after the &.
Delete