Friday, October 21, 2011

Language detection with Google's Compact Language Detector


Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.

Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content.   It looks like CLD was extracted from the language detection library used in Google's toolbar.

It turns out the CLD part of the Chromium source tree is nicely standalone, so I pulled it out into a new separate Google code project, making it possible to use CLD directly from any C++ code.

I also added basic initial Python binding (one method!), and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).

So detecting language is now very simple from Python:
    import cld
    topLanguageName = cld.detect(bytes)[0]
The detect method returns a tuple, including the language name and code (such as RUSSIAN, ru), an isReliable boolean (True if CLD is quite sure of itself), the number of actual text bytes processed, and then details for each of the top languages (up to 3) that were identified.

You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand.

You can also optionally provide hints to the detect method, including the declared encoding and language (for example, from an HTTP header or an embedded META http-equiv tag in the HTML), as well as the domain name suffix (so the top level domain suffix es would boost the chances for detecting Spanish). CLD uses these hints to boost the priors for certain languages. There is this fun comment in the code in front of the tables holding the per-language prior boots:
    Generated by dsites 2008.07.07 from 10% of Base
How I wish I too could build tables off of 10% of Base!

The code itself looks very cool and I suspect (but haven't formally verified!) its quite accurate.  I only understand bits and pieces about how it works; you can read some details here and here.

It's also not clear just how many languages it can detect; I see there are 161 "base" languages plus 44 "extended" languages, but then I see many test cases (102 out of 166!) commented out.  This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.

This port is all still very new, and I extracted CLD quickly, so likely there are some problems still to work out, but the fact that it passes the Python unit test is encouraging.  The README.txt has some more details.

Thank you Google!

78 comments:

  1. Thanks a lot! Will be very useful.

    ReplyDelete
  2. You can use Mozilla's libcharsetdetect to guess the encoding for the UTF8 conversion. I've packaged a standalone version of the code here: https://github.com/batterseapower/libcharsetdetect

    ReplyDelete
  3. Thank you very much. I’ve took your Python bindings and provided some for PHP: https://github.com/lstrojny/php-ccld

    ReplyDelete
    Replies
    1. link should be: https://github.com/lstrojny/php-cld

      Delete
  4. Very cool Mike! Did you do any tests on how it does with large vs small amounts of text?

    ReplyDelete
  5. Lars, awesome that you built PHP bindings! Thanks.

    ReplyDelete
  6. Karl, I haven't done any real testing yet, but I am trying to compare it to the Java language detection package: http://code.google.com/p/language-detection

    ReplyDelete
  7. @Mike: regarding the missing encoding implementation in the Python binding: I provide a class Encoding with all the defined encoding integers as class constants. Check regenerate-encoding-table.sh and cld_encodings.h, maybe that’s something for you too.

    ReplyDelete
  8. Lars, thank you! I've poached that back, exposed it as cld.ENCODINGS, and now support the encoding hint!

    ReplyDelete
  9. @Mike: cool, that allows me to remove a bunch of uglyness ;)

    ReplyDelete
  10. Thanks for putting this together. It did a nice job of identifying the languages used on Twitter.

    ReplyDelete
  11. Very cool. I'm trying to build the python bindings, but it fails because cld_encodings.h is missing. This include appears on Revision: 1c4ed384ca54 (I also set an issue on google code)

    ReplyDelete
  12. Woops, sorry about that aitzol -- I forgot to "hg add" the file. I just fixed it...

    ReplyDelete
  13. I was not aware that Chrome had a language identifier built into it, I had always assumed that it was done via queries to Google's Language Identification AJAX API. I have been developing my own approach to open-web language identification, it is available at https://github.com/saffsd/langid.py , and is based on my research that will be presented at IJCNLP2011 (http://www.ijcnlp2011.org/). I will compare my system to CLD when I can find some time to do so!

    ReplyDelete
    Replies
    1. I just wanted to add that my paper on cross-domain feature selection for language identification has been published. It is available at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf

      Delete
  14. Has anyone done comparisons with http://odur.let.rug.nl/~vannoord/TextCat/ ? I would assume the major strength of CLD is in how huge their corpora are …

    ReplyDelete
  15. Hi Marco,

    I ran langid.py on 18 of the Europarl languages and it performs very well! 99.20% (17856 / 18000) vs the best (Java language-detection Google code) at 99.26% (17866 / 18000). Impressive! Especially considering how small the overall model is (and I love how it's just packed into a big Python string!).

    da (95.4%) and sl (97.3%) are the two most challenging languages.

    Also, this brings the "majority rules" (across all 4 detectors) accuracy up to 99.73% (17952/ 18000), which is awesome (means langid.py is pulling from relatively independent "signals" than the others).

    Cool!

    ReplyDelete
  16. Anonymous, I haven't tried TextCat but it looks like a compelling option too!

    ReplyDelete
  17. Hi Janis,

    Not sure why your comment isn't shown here -- I'm copy/pasting it here:

    Hi! I have problems with building and installing this module on Windows7 (needed to install gcc witg mingw but when this was done, still was many errors while compiling) and Linux Ubuntu (full console with many errors like "./ceval.h:125: error: expected constructor, destructor, or type conversion before ā(ā token") - any suggestions what I else need to build your code?

    I was able to build on Windows, using the checked in build.win.cmd (I'm using Visual Studio 8), but I don't have the older Visual Studio installed to compile the Python bindings.

    On Linux (Fedora 13) I compiled fine with build.sh, using gcc 4.4.4, and then build the python bindings using setup.py.

    ReplyDelete
  18. Hi,
    I'm currently trying to follow the steps to build the python bindings for the CLD under Windows 7. I (successfully) built the library using 'build.win.cmd', but when i try to run the python set up, I run into an error I don't know what to do with:

    C:\Users\sascha\eclipse\workspaces\pydev\chromium-compact-language-detector>python -u setup.py build
    running build
    running build_ext
    building 'cld' extension
    C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I. -IC:\Python26\include -IC:\Python26\PC /Tppycldmodule.cc /Fobuild\temp.win32-2.6\Release\pycldmodule.obj
    pycldmodule.cc
    c:\users\sascha\eclipse\workspaces\pydev\chromium-compact-language-detector\base/string_util.h(23) : error C2039: 'strcasecmp' : is not a member of '`global namespace''
    error: command '"C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\BIN\cl.exe"' failed with exit status 2


    Any suggestions would be greatly appreciated!
    Thanks, Sascha

    ReplyDelete
  19. Hi Sascha,

    I just pushed a fix for that compilation error -- can you "hg pull -u" and see if it works?

    This was due to a missing compile time define (/DWIN32) in setup.py.

    ReplyDelete
  20. thanks for this handy tool!

    ReplyDelete
  21. Hi Mike,

    thanks a lot for the fix, worked great now! Thank you for your help!

    A note to other windows users: For some reason, something seemed to have gone wrong in renaming a file during the build. So after running

    'build.win.cmd'

    I had to run

    'ren libcld.lib cld.lib'

    before then running

    'python -u setup.py build' and afterwards
    'python -u setup.py install'.

    ReplyDelete
  22. Hello Mike,

    Glad to hear that langid.py is working well for you. I'm continuing to develop it, trying to get it to work well with even shorter strings in even more languages.

    @Anonymous RE:TextCat, I compared my tool langid.py extensively to TextCat in my recently published paper, a copy is available at http://www.ijcnlp2011.org/proceeding/IJCNLP2011-MAIN/pdf/IJCNLP-2011062.pdf . Our findings were quite straightforwards, the performance of TextCat really starts to fall off as more languages and more domains are considered.

    ReplyDelete
    Replies
    1. I've been experimenting with Chrome CLD and langid.py, analysing 31,160 tweets containing "S4C" (the name of Wales' Welsh language TV channel - though not all references to S4C will have been to the channel). Most of the tweets are in English or Welsh: according to Chrome CLD there were 16,339 in English, 9,464 in Welsh. langid.py made it 18,219 English, 10,303 Welsh. Chrome CLD left 4,138 as "unknown". 8,981 were categorised as Welsh by both of them. Chrome CLD left 1,108 of langid.py's Welsh categorised tweets as unknown. You might be interested to see how the percentage agreement between Chrome CLD and langid.py varied by length of the tweet in this chart: http://dl.dropbox.com/u/15813120/Chrome_CLD_v_langid.py_Saes.png

      (I know tweets are only meant to be 140 chars and my chart shows tweets longer than that. I reckon that must be because of encoding problems which I must have failed to cope with somewhere).

      Delete
    2. Thanks for sharing! The comparison is cool. Let me check that I understand 'agreement' correctly- it basically means that the two systems produced the same output for a given message?

      So if for a given message, langid.py output 'en' and cld output 'UNKNOWN', then this would be considered disagreement correct? My guess (not based on evidence!) is that cld will tend to output 'UNKNOWN' more often for shorter messages, and that this may account for some of the difference. I would be curious to see a comparison of messages where neither system labels the message 'unknown'. Also, both systems provide a measure of confidence, so you could also consider the correlation between confidence and the accuracy.

      On the message length issue, I believe Twitter allows for 140 UTF8 codepoints. If I recall correctly, UTF8 can use up to 6 bytes per codepoint, allowing for theoretical upper bound of 840 bytes.

      Delete
    3. Sorry for taking so long to reply! Yes, "agreement" means the same language was detected.

      I didn't get any 'unknowns' from langid.py. Excluding the 4,138 'unknowns' produced by Chrome CLD gave me this: https://dl.dropbox.com/u/15813120/no_unknowns_Chrome_CLD_v_langid.py_Saes.png i.e. much higher proportions in agreement, with the proportion dropping off when the tweet gets shorter than 70 characters. Such short tweets are less common though, as shown in this density plot: https://dl.dropbox.com/u/15813120/density_no_unknowns_Chrome_CLD_v_langid.py_Saes.png

      Delete
  23. Hi Mike,

    Thank you for providing a Python wrapper for the CLD. I compiled the latest version with MinGW and it works. One thing however: You write in the README that you made no changes to the original Chromium source but at least encodings/compact_lang_det/compact_lang_det.h differs.

    Thank you again,

    /David

    ReplyDelete
  24. Aha! You are right David; I actually did modify compact_lang_det.{h,cc}, and compact_lang_det_impl.{h,cc}. These files provide the "entry points" to the core CLD library... and my changes were minor: I removed a few entry points that were likely backwards-compatible layers for within Chrome (not important here), and I also opened up control over previously hardwired functionality for removing weak matches and picking the summary language.

    Also, cld_encodings.h is new (I copied this from the PHP port), and it just provides mappings from the encoding constants to their string names...

    ReplyDelete
  25. Thanks for your work. I can't fully reconcile the tuple returned with your description. For example, testing a string of 261 characters (with 3 languages) I get:
    'ENGLISH', 'en', True, 264, [('ENGLISH', 'en', 78, 158.02269043760128), ('IRISH', 'ga', 22, 53.226879574184963)]).

    I guess that 78 is the length detected as English, 22 detected as Irish, but what is the 158.022... and 53.226...? Probabilities expressed as parts of 1000?

    ReplyDelete
  26. Actually, 78 and 22 is the "percent likelihood" for the match, and then the number after that is called "normalized_score" in the code. I'm really not sure exactly how to interpret these numbers, except to say that higher numbers mean stronger matches...

    The net number of bytes matched is returned at the top (ie, not per language that matched); in your case it's 264.

    ReplyDelete
  27. Hi Mike

    I have several questions:
    - Do you plan to make "official" release for your project?
    - How often do you synchronize source with Chromium?

    thank you

    ReplyDelete
  28. Hi Alex,

    Unfortunately I don't sync up w/ CLD at all (or at least, I haven't yet). And I wasn't planning on doing official releases...

    Patches are welcome!!

    Thanks,

    Mike

    ReplyDelete
  29. Mike,

    Nice work!

    CLD is great for detecting the language for a given buffer. However, I need to extract the word boundaries from the buffer the as well as detect the language for each for these words.

    I'm developing a search engine and I need to pass each word to the appropriate language specific stemmer.

    ReplyDelete
  30. Tokenizing mixed language content is definitely a challenge ... you could try Lucene's StandardAnalyzer? It tokenizes according to the Unicode UAX #29 text segmentation standard.

    Maybe also email the Lucene users list (java-user@lucene.apache.org)?

    ReplyDelete
  31. Hi Mike,

    this post awesome for me.

    I have several questions:

    CLD is very useful and perfectly framework.

    but, i googling all about CLD. i not get a hint about using for iOS(objectiveC).

    Could please hint for me to build CLD library using for iphone?

    using gcc compiler armv7 (iPhoneDevice), i386 (iPhoneSimulator) It seems to be converted for use, maybe modifying build.sh script file. but, I do not know what to do.

    In other ways besides the above hint. java, python, ruby, C #, node.js etc. seems to be ported. but sadly not objectiveC.

    How do ported or build the CLD library for the development of iOS?

    finally, iOS only availabe static library, not allowed dynamic library.

    my pc os is macOSX10.6.7 Lion, xcode 4.3.2. thanks.

    ReplyDelete
  32. Hi Minseok,

    Unfortunately I don't know much about iOS development. It could be that langid.py https://github.com/saffsd/langid.py is a better fit?

    ReplyDelete
  33. Hi Mike,

    It's great to find your tool online!
    I was trying to run it to test, however, in the new package I am missing the 'bindings' dir.. I see it in the sources, though.. How do I get the bindings dir without actually checking out the source code?

    Thanks a lot!

    ReplyDelete
  34. Liolik,

    Hmmm: can you open an issue at http://code.google.com/p/chromium-compact-language-detector/issues/list ? Thanks.

    We recently reworked the packaging so something could easily be wrong ... are you using the C or Python APIs? And, which release package?

    ReplyDelete
  35. Hi Mike,
    Very nice work.

    I've just downloaded and compiled/installed from the .tar.gz source file and the 'bindings' dir is actually missing (as mentioned by Liolik).
    http://code.google.com/p/chromium-compact-language-detector/downloads/detail?name=compact-language-detector-0.1.tar.gz&can=2&q=

    Ray

    ReplyDelete
  36. Hi Liolik, Anonymous: I put a comment on the issue ...

    ReplyDelete
  37. thanks Mike, I'll give it a try.
    Ray

    ReplyDelete
  38. Hi Mike,
    I am new to python, how do i install cld module to my python. I dont know how to install new modules to python. tried looking at many blogs most tell using setup.py file by command: python setup.py install . But I dont find such a thing in the cld files. please help me i need langauge detection even its not pythonic way. so suggestions other than python are also more than welcome. Thanks a lot
    Manoj

    ReplyDelete
  39. Hi manoj1919,

    First you need to build & install CLD from sources (see the downloads on the google code site).

    Then, download the Python wrapper from pypi: http://pypi.python.org/pypi/chromium_compact_language_detector/0.1.1
    and run setup.py from there.

    ReplyDelete
  40. Mike,

    Have installed module and bindings apparently correctly. I have a
    chromium_compact_language_detector-0.1.1-py2.7.egg-info and cld.so
    in /usr/lib/python2.7/site-packages which I take as rather positive signs!

    However, when in the python interpreter upon trying to import cld I get

    >>> import cld
    Traceback (most recent call last):
    File "", line 1, in
    ImportError: libcld.so.0: cannot open shared object file: No such file or directory

    Any idea what I am doing wrong?

    Many thanks for making this available. I hope I can use it.

    Regards

    ReplyDelete
  41. Hi Jérôme,

    Hmm that's odd. Are you sure you're running Python2.7 when you run "import cld"?

    If you import sys and print sys.path does it have /usr/lib/python2.7/site-packes in the list?

    ReplyDelete
    Replies
    1. Running 2.7 for sure and sys.path does have /usr/lib/python2.7/site-packages. See below.


      [jrichalot@myhost site-packages]$ python2
      Python 2.7.2 (default, Jan 31 2012, 13:26:35)
      [GCC 4.6.2 20120120 (prerelease)] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import cld
      Traceback (most recent call last):
      File "", line 1, in
      ImportError: libcld.so.0: cannot open shared object file: No such file or directory
      >>> import sys
      >>> print sys.path
      ['', '/usr/lib/python2.7/site-packages/guardian_openplatform-0.0.2-py2.7.egg', '/usr/lib/python2.7/site-packages/simplejson-2.3.3-py2.7-linux-i686.egg', '/usr/lib/python2.7/site-packages/html5lib-0.95-py2.7.egg', '/usr/lib/python2.7/site-packages/MySQL_python-1.2.3-py2.7-linux-i686.egg', '/usr/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg', '/usr/lib/python2.7/site-packages/pip-1.1-py2.7.egg', '/usr/lib/python2.7/site-packages/python_twitter-0.8.2-py2.7.egg', '/usr/lib/python2.7/site-packages/oauth2-1.5.211-py2.7.egg', '/usr/lib/python27.zip', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/lib/python2.7/site-packages', '/usr/lib/python2.7/site-packages/PIL', '/usr/lib/python2.7/site-packages/gst-0.10', '/usr/lib/python2.7/site-packages/gtk-2.0']

      Delete
  42. Hmm does your LD_LIBRARY_PATH point to the directory (maybe /usr/local/lib?) where libcld.so is installed?

    ReplyDelete
  43. 1. Hmm indeed, I do not seem to have a D_LIBRARY_PATH

    [root@myhost ~]# echo $LD_LIBRARY_PATH

    [root@myhost ~]#

    2. libcld.so is indeed in /usr/local/lib

    Must investigate... TBC

    ReplyDelete
    Replies
    1. And we have success


      [jrichalot@myhost ~]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"/usr/local/lib"
      [jrichalot@myhost ~]$ echo $LD_LIBRARY_PATH
      :/usr/local/lib
      [jrichalot@myhost ~]$ python2
      Python 2.7.2 (default, Jan 31 2012, 13:26:35)
      [GCC 4.6.2 20120120 (prerelease)] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import cld
      >>>

      not sure the solution is advisable and/or persistent tbh but that will allow me to to keep working.

      Many thanks for your prompt and kind assistance in this matter.

      Delete
  44. I had trouble getting the official project installed on Windows, so was pointed to this location for an easy_install: http://www.lfd.uci.edu/~gohlke/pythonlibs/#cld

    However, I am getting encoding issues with the python bindings.

    clean_text = 'a tweet from twitter'
    clean_text_utf = clean_text.encode('utf-8', 'ignore')
    cld.detect(clean_text_utf , pickSummaryLanguage=True, removeWeakMatches=True)

    After awhile, I get 'UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4'

    Any ideas on how to resolve?

    ReplyDelete
  45. Hi Simon,

    Can you post a small test case showing that exception? Is it coming from within CLD?

    ReplyDelete
  46. Working on the issue on SE here:
    http://stackoverflow.com/questions/13473861/encoding-issues-with-cld
    Likely not related to CLD, but my rookie python skills in getting the required string across to CLD?

    One thing you might be able to help me with (BTW, your name will appear in the credits on final map I am making) is if my CLD parameters are accurate. At the moment, I am seeing some strings being misinterpreted.

    some_text = "ha I dont know When can I know your news"
    cld.detect(some_text, pickSummaryLanguage=True, removeWeakMatches=True)
    This returns:
    ('FRENCH', 'fr', True, 43, [('ENGLISH', 'en', 62, 26.74230145867099), ('FRENCH', 'fr', 38, 8.2219938
    33504625)])

    I am grabbing the first value as the predicted language, but it is often wrong.
    What is the significance of the 26 and 8 values. Are these confidence scores?



    My cut down code: http://pastebin.com/3DU2BYp0

    ReplyDelete
  47. Hi Simon,

    I put an answer on the stackoverflow question.

    That first value is in fact the predicted language, and it's clearly wrong in your example! Urgh. I confirmed I get the same results on Linux ...

    Maybe try passing pickSummaryLanguage=False? There is some "smarts" in CLD that sometimes picks a weaker matching language as the choice, as happened in this example. When I pass pickSummaryLanguage=False, it gets the correct answer for your example ... and I think when I ran my benchmark I passed False.

    The 26/8 are what CLD calls the "normalized score"; I'm not sure how it's computed ...

    ReplyDelete
  48. Mike - really appreciate the quick response.
    Glad to know that your getting same results at your end. I have changed to False.

    From looking at my result above, I was confused that the first result is French, but the 'normalised score' for English was higher than the score for French?

    ReplyDelete
  49. I also find that confusing ... somehow whatever "smarts" is implemented in the pickSummaryLanguage=True is able to take a worse-scoring language and pick it ... I'm not sure why :) And I think in my original tests I saw worse accuracy if I passed True.

    ReplyDelete
  50. Thanks Mike. Im getting better results than before with the psl=t, but im thinking that I should perhaps add logic to manually loop through the list of results, and pick the one with the highest score, as opposed to relying on the first result returned. Where do you think would be the best place to query this further?

    ReplyDelete
  51. I believe with pSL=False that you'll always get the top scoring language as the choice. Have you ever found a case where you didn't?

    ReplyDelete
  52. Your right, sticking with False. One thing I am pondering is why im not getting a lot of greek results coming through. If I go to Google Translate, I can type a basic english sentence, grab the greek, and paste it into CLD, and it returns wrong language pretty much everytime. Any ideas here?

    >>> cld.detect("Σήμερα ο καιρός είναι ζεστός", pickSummaryLanguage=False, removeWeakMatches=True)
    ('RUSSIAN', 'ru', True, 22, [('RUSSIAN', 'ru', 31, 1.303780964797914)])

    ReplyDelete
  53. Simon, you need to first encode that greek string as UTF8, eg:

    >>> import codecs
    >>> cld.detect(codecs.getencoder('UTF-8')(u'Σήμερα ο καιρός είναι ζεστός')[0])

    ('GREEK', 'el', True, 54, [('GREEK', 'el', 100, 54.0)])

    ReplyDelete
  54. hi ,
    please someone can give me the steps to install cld under windows 7.
    thanks

    ReplyDelete
  55. Hi Anonymous,

    At one point the build.win.cmd worked on windows, for just the library (not the Python wrapper), but I haven't tested this in some time ...

    ReplyDelete
    Replies
    1. there is http://www.lfd.uci.edu/~gohlke/pythonlibs/#cld an executable ready for installation . I want to know which of these two library (langid and CLD) is better for classifying tweets

      Delete
  56. I want to know which of these two library (langid and CLD) is better for classifying tweets

    ReplyDelete
  57. Hi Anonymous,

    I don't know off hand which library is better ... you need to test for yourself I think.

    ReplyDelete
  58. Mike,

    Just wanted to say thanks for the awesome module. I was using guess-language from pip before but CLD gave me a huge boost in result accuracy!

    I can also confirm that the 0.2 version didn't work on heroku but the 0.031415 does!! So thanks again for that :)

    Cheers,

    ReplyDelete
  59. Hi Sylvain,

    I'm glad you found CLD useful. Say thank you to Google :)

    That's spooky that 0.2 does NOT work but 0.031415 (the older version) does. What went wrong with 0.2?

    ReplyDelete
  60. I did a comparison of CLD with our own language detection API web service (WhatLanguage.net). You can read the full comparison of WhatLanguage.net, CLD, Tika, language-detection and langid.py at http://www.whatlanguage.net/en/api/accuracy_language_detection

    ReplyDelete
  61. This is really a great post, thank you Mike.

    I am trying to integrate CLD into my java code with no luck so far. The java wrapper here https://github.com/mzsanford/cld didn't work for 64bit machines. Any idea if there is another wrapper?

    Best,
    Ed

    ReplyDelete
    Replies
    1. Hi Anonymous,

      I don't know anything about that port ... and I don't know of any other Java ports.

      However, there is a new java port of langid.py (https://github.com/saffsd/langid.py) at https://github.com/carrotsearch/langid-java ... I did some simple tests and it gets the same results as langid.py and is quite a bit faster.

      There is also the language detection library https://code.google.com/p/language-detection/

      Delete
  62. which one is the best for the short texts?

    ReplyDelete
    Replies
    1. Hi Anonymous,

      I'm really not sure ... you should go test them and then report back! But in general short text is quite a bit harder...

      Delete
  63. Hi Mike,
    Just wanted to say thanks! Definitely appreciate all the initiative you took on this project. We're finding the CLD Python binding really useful.

    We did come across a strange problem where CLD fails to detect the correct language when an '&' character is part of the text. I wondered if anybody else had encountered this (or maybe I've missed something obvious).

    >>> clean_text = "Nation & world: Russian president says Sochi will be 'fully tolerant' of gay athletes at Olympics"
    >>> clean_text_utf = clean_text.encode('utf-8', 'ignore')
    >>> cld.detect(clean_text_utf , pickSummaryLanguage=True, removeWeakMatches=True)
    ('Unknown', 'un', True, 9, [])
    >>> cld.detect(clean_text_utf)
    ('Unknown', 'un', True, 9, [])

    if we remove the '&' character, accuracy returns to normal
    >>> clean_text = "Nation world: Russian president says Sochi will be 'fully tolerant' of gay athletes at Olympics"
    >>> clean_text_utf = clean_text.encode('utf-8', 'ignore')
    >>> cld.detect(clean_text_utf)
    ('ENGLISH', 'en', True, 95, [('ENGLISH', 'en', 100, 76.985413290113456)])

    There seem to be some other troublesome characters as well.

    I see there is a CLD2. I don't yet have a sandbox install to see if this problem remains.

    Best,

    -Pat

    ReplyDelete
    Replies
    1. My guess is it's trying to parse an escaped HTML character? Are you passing isPlainText=False (this is the default). If so, can you open an issue with the CLD2 project? A lone & should be untouched ... but maybe CLD2 is doing something silly like throwing out the rest of the input after the &.

      Delete