tag:blogger.com,1999:blog-8623074010562846957.post7288601351161582082..comments2023-09-01T03:38:08.236-04:00Comments on Changing Bits: Accuracy and performance of Google's Compact Language DetectorMichael McCandlesshttp://www.blogger.com/profile/04277432937861334672noreply@blogger.comBlogger28125tag:blogger.com,1999:blog-8623074010562846957.post-41471581451253445532017-10-05T13:31:58.665-04:002017-10-05T13:31:58.665-04:00btw, here are results of evaluation of language de...btw, here are results of evaluation of language detection with fastText: http://alexott.blogspot.de/2017/10/evaluating-fasttexts-models-for.htmlAlex Otthttps://www.blogger.com/profile/13001951608173211050noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-56086212166660835242016-01-07T06:35:49.371-05:002016-01-07T06:35:49.371-05:00I think you need to build CLD2 from sources yourse...I think you need to build CLD2 from sources yourself? See https://github.com/CLD2Owners/cld2Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-88813425117163945272016-01-07T04:22:18.056-05:002016-01-07T04:22:18.056-05:00This is great stuff!, I want to totally adopt the ...This is great stuff!, I want to totally adopt the voting approach to boost the accuracy of my application, however I can't find the maven package for the CLD/CLD2. Anyone knows where it is ?Yogihttps://www.blogger.com/profile/09268590087143045163noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-40080534112680723722015-06-14T15:04:11.889-04:002015-06-14T15:04:11.889-04:00Hi, I suggest you first work with the CLD2 team (h...Hi, I suggest you first work with the CLD2 team (https://code.google.com/p/cld2/ ) to figure out how to compile the libraries on windows?Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-11361382731442078152015-03-30T04:11:06.218-04:002015-03-30T04:11:06.218-04:00I am trying to compile CLD2 in Visual Studio 2013....I am trying to compile CLD2 in Visual Studio 2013. I am actually trying to create a .NET Wrapper for the library so I have added all the source files inside my CLR project.<br /><br />Now whenever I compile I get these linking errors.<br /><br /> error LNK2005: "struct CLD2::CLD2TableSummary const CLD2::kCjkDeltaBi_obj" (?kCjkDeltaBi_obj@CLD2@@3UCLD2TableSummary@1@B) already defined in cld_generated_cjk_delta_bi_32.obj<br /><br />These all seems to be related as I can see a relation between the 'generated' files.<br /><br />Problem is I have a lot of these and I am not sure which ones I should exclude and which I should keep and use in my code.<br /><br />Here is a list all the generated files that came with the CLD2 code.<br /><br /> cld_generated_cjk_uni_prop_80.cc<br /> cld_generated_score_quad_octa_2.cc<br /> cld_generated_score_quad_octa_0122.cc<br /> cld_generated_score_quad_octa_0122_2.cc<br /> cld_generated_score_quad_octa_1024_256.cc<br /> cld_generated_cjk_delta_bi_4.cc<br /> cld_generated_cjk_delta_bi_32.cc<br /> cld2_generated_octa2_dummy.cc<br /> cld2_generated_quad0122.cc<br /> cld2_generated_quad0720.cc<br /> cld2_generated_quadchrome_2.cc<br /> cld2_generated_quadchrome_16.cc<br /> cld2_generated_cjk_compatible.cc<br /> cld2_generated_deltaocta0122.cc<br /> cld2_generated_deltaocta0527.cc<br /> cld2_generated_deltaoctachrome.cc<br /> cld2_generated_distinctocta0122.cc<br /> cld2_generated_distinctocta0527.cc<br /> cld2_generated_distinctoctachrome.cc<br /><br />The naming convention of these suggests that I should only be using one of each group. At least that how I think I should use it as I am not really an expert in encoding nor in how CLD2 works. And I could not find any references online explaining how to configure it.<br /><br />I tried eliminating the linking errors by keeping only one of each generated group:<br /><br />for example: from `cld_generated_cjk_delta_bi_4` and `cld_generated_cjk_delta_bi_32` I kept the 32 version. And so on for the rest of the files.<br /><br />Now this made CLD compile yet when I tried testing it with languages I noticed that the scores were way way off and it was behaving inexplicably bad.<br /><br />I am not trying to support all languages I only need to support latin languages along with hebrew, arabic, japanese and chinese.<br /><br />Can someone please explain how to configure CLD2 to compile and work correctly.The Red Serpenthttps://www.blogger.com/profile/12312151282250214943noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-46970148029399530242014-02-05T11:54:52.764-05:002014-02-05T11:54:52.764-05:00Thank you very much Michael, I'll check it.Thank you very much Michael, I'll check it.Junior Devhttps://www.blogger.com/profile/09179967363928745331noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-57539589178097225632014-02-05T05:47:11.234-05:002014-02-05T05:47:11.234-05:00Hi Denys,
Try reading the pages at CLD2's wik...Hi Denys,<br /><br />Try reading the pages at CLD2's wiki? https://code.google.com/p/cld2/w/listMichael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-42643039752445231412014-02-04T10:37:37.552-05:002014-02-04T10:37:37.552-05:00Anyone know where I can find a detailed explanatio...Anyone know where I can find a detailed explanation of how the library works. I mean, for example, the criteria used to choose the best language, which hash function used to store n-gram, etc. It would be helpful ..Junior Devhttps://www.blogger.com/profile/09179967363928745331noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-88823644900698049022013-09-23T00:32:51.829-04:002013-09-23T00:32:51.829-04:00Actually most of the my shorts texts contains sing...Actually most of the my shorts texts contains single words.I didnt have enough time to implement Tika, though i read some blogs and papers, they all clearly suggests Tika is magnitude slower and bit inaccurate. <br /> <br />But again for long texts , i feel language-detection is little better. Atlast I selected language-detection for project, because of its overall performance and accuracy!!!!Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-27934435159030081702013-09-19T17:29:03.514-04:002013-09-19T17:29:03.514-04:00Thanks for sharing!
That's interesting that l...Thanks for sharing!<br /><br />That's interesting that langid is so much better for short texts; what was Tika's accuracy?<br />Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-47383434644419636632013-09-17T22:22:06.326-04:002013-09-17T22:22:06.326-04:00Hi Mike ,
I am glad you replied. I...Hi Mike ,<br /> I am glad you replied. I tested for two modules. <br /> 1)language-detection 2)langid-java (https://github.com/carrotsearch/langid-java)<br /> For short texts and single words, langid gives the better accuracy than language detection.<br /> langid- java was accurate for more than 80% of test cases, whereas language-detection <br /> was somewhere between 60-65%. ( don't know the reason though!!!) <br /> For large texts and sentences, language-detection is slightly better.<br /><br /> GCLD was implemented in C++, i prefer java because of my project. Apache Tika<br /> was good but was bit slower as compared to language-detection and langid-java.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-21084880211124157022013-09-12T13:35:24.863-04:002013-09-12T13:35:24.863-04:00Hi Anonymous,
I'm really not sure; best is to...Hi Anonymous,<br /><br />I'm really not sure; best is to go test yourself, and then report back! But in general short text is harder ...Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-17205828020300124812013-09-11T22:04:37.785-04:002013-09-11T22:04:37.785-04:00Hi Mike,
Could you please tell me which one is t...Hi Mike, <br /> Could you please tell me which one is the best for short texts? I had a problem with language-detection for shorter messages. I heard about Twitter’s machine language detection algorithm and they say it is way accurate for the short texts.<br /> <br />Waiting for your suggestions (of course reply)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-4051851564294786172013-04-21T07:57:17.431-04:002013-04-21T07:57:17.431-04:00Hi Nick,
Those are impressive results! Do you de...Hi Nick,<br /><br />Those are impressive results! Do you describe your approach / share the source code anywhere?Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-14394936562839561682013-04-21T07:28:51.968-04:002013-04-21T07:28:51.968-04:00Great comparison, Mike. I used your results to com...Great comparison, Mike. I used your results to compare the accuracy of my own language identification web service API (WhatLanguage.net). I also added langid.py in the mix, another popular library. You can find an accuracy comparison between WhatLanguage.net, CLD, Tika, language-detection and langid at http://www.whatlanguage.net/en/api/accuracy_language_detectionAnonymoushttps://www.blogger.com/profile/06743340533471936429noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-59784114613083986842012-11-18T11:34:48.135-05:002012-11-18T11:34:48.135-05:00Hi Ruud,
I think Google's CLD was trained on ...Hi Ruud,<br /><br />I think Google's CLD was trained on a wider corpus than Wikipedia.<br /><br />I don't know much about how Tika's language detection is trained ... if you work it out please submit a patch describing it! But I think Tika really should just use one of the existing language detectors ...Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-80949697319662187162012-11-17T02:02:38.414-05:002012-11-17T02:02:38.414-05:00I tested the Google stuff, and confusion between A...I tested the Google stuff, and confusion between Afrikaans and Dutch was very bad indeed.<br />Language profile seem to have been made from Wikipedia, which is by design full of non-native content and proper names. Maybe that is bad training material. Results from training the Frysian profile from the Frysian wiki supports this idea. The Dutch Wikipedia is not very big, a better corpus might help.<br /><br />Ik can find no info at all on how to train new languages to TIKA. Is there any?<br />Ruudhttps://www.blogger.com/profile/01670736439975411185noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-62352040648608122022012-10-12T22:26:03.992-04:002012-10-12T22:26:03.992-04:00Great comparison.
Very useful for me.
Thanks a lo...Great comparison.<br />Very useful for me.<br /><br />Thanks a lot!!!Anonymoushttps://www.blogger.com/profile/02221147231058910325noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-71428593527683969532012-01-26T04:16:23.452-05:002012-01-26T04:16:23.452-05:00Ok. Thanks!Ok. Thanks!Daniel Lindmarkhttps://www.blogger.com/profile/06502423490438941590noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-83906247181920520062012-01-25T11:49:32.042-05:002012-01-25T11:49:32.042-05:00Hi Daniel,
I imagine the Chromium team knows how ...Hi Daniel,<br /><br />I imagine the Chromium team knows how to generate the tables from training data... but I sure don't know how! It's a great question.<br /><br />Maybe try emailing the chromium-dev group? (http://groups.google.com/a/chromium.org/group/chromium-dev/topics )Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-46430612518418612192012-01-25T11:07:13.966-05:002012-01-25T11:07:13.966-05:00Hi Mike!
Thanks for sharing this! I'm just cur...Hi Mike!<br />Thanks for sharing this! I'm just curious to know if there is the possibility in CLD to re-train the language models/tables? <br /><br />DanielDaniel Lindmarkhttps://www.blogger.com/profile/06502423490438941590noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-40530863656124917532011-11-30T02:16:29.504-05:002011-11-30T02:16:29.504-05:00Hello Mike
I just spotted this post of yours now....Hello Mike<br /><br />I just spotted this post of yours now. Thanks for sharing this detailed comparison. Your presentation of results is very clear and concise.<br /><br />@Cosmin I am the author of langid.py. I am working on a large-scale comparison of different langid tools, on a variety of datasets with different characteristics. Language identification of very short text segments is being discussed quite alot due to the popularity of services like Twitter, so I will be including tests aimed specifically at investigating performance over Twitter data. Preliminary results indicate that for most tools, performance is quite far below the long-text levels.Marco Luihttps://www.blogger.com/profile/10192631418200993311noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-56368591353515978172011-11-15T12:42:21.117-05:002011-11-15T12:42:21.117-05:00Hi Cosmin,
I don't know much about langid.py ...Hi Cosmin,<br /><br />I don't know much about langid.py (maybe try contacting its author?). It did test very well for me on the Europarl corpus though...<br /><br />On your compilation errors, you need to have Visual Studio installed, and the tools have to be on your "path". Visual Studio includes a .bat file to setup your path... to verify it's working, just type "cl.exe" at the command prompt and confirm the executable is found.Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-72510439325165232112011-11-15T09:56:00.469-05:002011-11-15T09:56:00.469-05:00Tanks for the reply on langid.py
Have you tested i...Tanks for the reply on langid.py<br />Have you tested it's performance on small chunks of text, or just plain words?<br />I've made a few tests and the accuracy is not that great actually, do you have any idea if the module is based only on large texts or .. ?<br /><br />Thanks again!<br /><br />And about the CLD, i have tried to install it on Windows XP SP2 and the "build.win.cmd" gives me some errors, egg:<br />Could Not Find C:\chromium-compact-language-detector\libcld.lib<br /><br />'cl.exe' is not recognized as an internal or external command, operable program or batch file.<br /><br />'lib.exe' is not recognized as an internal or external command, operable program or batch file.<br /><br />Can you help me on this problem ?Cosmin Popanoreply@blogger.comtag:blogger.com,1999:blog-8623074010562846957.post-5383140337742096302011-11-11T06:45:50.669-05:002011-11-11T06:45:50.669-05:00Hi Anonymous,
Here's a "pure Python"...Hi Anonymous,<br /><br />Here's a "pure Python" language detector: https://github.com/saffsd/langid.py<br /><br />Its accuracy is very good; see: http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html?showComment=1319806660503#c569166902196965255Michael McCandlesshttps://www.blogger.com/profile/04277432937861334672noreply@blogger.com