Changing Bits: Accuracy and performance of Google's Compact Language Detector

Tuesday, October 25, 2011

Accuracy and performance of Google's Compact Language Detector

To get a sense of the accuracy and performance of Google's Compact Language Detector, I ran some tests against two other packages:

Apache Tika, implemented in Java, using its LanguageIdentification class
language-detection, a project on Google code, also implemented in Java

For the test corpus I used a the corpus described here, created by the author of language-detection. It contains 1000 texts from each of 21 languages, randomly sampled from the Europarl corpus.

It's not a perfect test (no test ever is!): the content is already very clean plain text; there are no domain, language, encoding hints to apply (which you'd normally have with HTML content loaded over HTTP); it "only" covers 21 languages (versus at least 76 that CLD can detect).

CLD and language-detection cover all 21 languages, but Tika is missing Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv), so I only tested on the remaining subset of 17 languages that all three detectors support. This works out to 17,000 texts totalling 2.8 MB.

Many of the texts are very short, making the test challenging: the shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or less.

In addition to the challenges of the corpora, the differences in the detectors make the comparison somewhat apples to oranges. For example, CLD detects at least 76 languages, while language-detection detects 53 and Tika detects 27, so this biases against CLD, and language-detection to a lesser extent, since their classification task is harder relative to Tika's.

For CLD, I disabled its option to abstain (removeWeakMatches), so that it always guesses at the language even when confidence is low, to match the other two detectors. I also turned off the pickSummaryLanguage, as this was also hurting accuracy; now CLD simply picks the highest scoring match as the detected language.

For language-detection, I ran with the default ALPHA of 0.5, and set the random seed to 0.

Here are the raw results:

CLD results (total 98.82% = 16800 / 17000):

da	93.4%	da=934	nb=54	sv=5	fr=2	eu=2	is=1	hr=1	en=1
de	99.6%	de=996	en=2	ga=1	cy=1
el	100.0%	el=1000
en	100.0%	en=1000
es	98.3%	es=983	pt=4	gl=3	en=3	it=2	eu=2	id=1	fi=1	da=1
et	99.6%	et=996	ro=1	id=1	fi=1	en=1
fi	100.0%	fi=1000
fr	99.2%	fr=992	en=4	sq=2	de=1	ca=1
hu	99.9%	hu=999	it=1
it	99.5%	it=995	ro=1	mt=1	id=1	fr=1	eu=1
nl	99.5%	nl=995	af=3	sv=1	et=1
pl	99.6%	pl=996	tr=1	sw=1	nb=1	en=1
pt	98.7%	pt=987	gl=4	es=3	mt=1	it=1	is=1	ht=1	fi=1	en=1
ro	99.8%	ro=998	da=1	ca=1
sk	98.8%	sk=988	cs=9	en=2	de=1
sl	95.1%	sl=951	hr=32	sr=8	sk=5	en=2	id=1	cs=1
sv	99.0%	sv=990	nb=9	en=1

Tika results (total 97.12% = 16510 / 17000):

87.6%

da=876

no=112

nl=4

sv=3

it=1

fr=1

et=1

en=1

de=1

98.5%

de=985

nl=3

it=3

da=3

sv=2

fr=2

sl=1

ca=1

100.0%

el=1000

96.9%

en=969

no=10

it=6

ro=4

sk=3

fr=3

hu=2

et=2

sv=1

89.8%

es=898

gl=47

pt=22

ca=15

it=6

eo=4

fr=3

fi=2

sk=1

nl=1

et=1

99.1%

et=991

fi=4

fr=2

sl=1

no=1

ca=1

99.4%

fi=994

et=5

hu=1

98.0%

fr=980

sl=6

eo=3

et=2

sk=1

ro=1

no=1

it=1

gl=1

fi=1

es=1

de=1

ca=1

99.9%

hu=999

ca=1

99.4%

it=994

eo=4

pt=1

fr=1

97.8%

nl=978

no=8

de=3

da=3

sl=2

ro=2

pl=1

it=1

gl=1

et=1

99.1%

pl=991

sl=3

sk=2

ro=1

it=1

hu=1

fi=1

94.4%

pt=944

gl=48

hu=2

ca=2

it=1

et=1

es=1

en=1

99.3%

ro=993

is=2

sl=1

pl=1

it=1

hu=1

fr=1

96.2%

sk=962

sl=21

pl=13

it=2

ro=1

et=1

98.5%

sl=985

sk=7

et=4

it=2

pt=1

no=1

97.1%

sv=971

no=15

nl=6

da=6

de=1

ca=1

Language-detection results (total 99.22% = 16868 / 17000):

da	97.1%	da=971	no=28	en=1
de	99.8%	de=998	da=1	af=1
el	100.0%	el=1000
en	99.7%	en=997	nl=1	fr=1	af=1
es	99.5%	es=995	pt=4	en=1
et	99.6%	et=996	fi=2	de=1	af=1
fi	99.8%	fi=998	et=2
fr	99.8%	fr=998	sv=1	it=1
hu	99.9%	hu=999	id=1
it	99.8%	it=998	es=2
nl	97.7%	nl=977	af=21	sv=1	de=1
pl	99.9%	pl=999	nl=1
pt	99.4%	pt=994	es=3	it=1	hu=1	en=1
ro	99.9%	ro=999	fr=1
sk	98.7%	sk=987	cs=8	sl=2	ro=1	lt=1	et=1
sl	97.2%	sl=972	hr=27	en=1
sv	99.0%	sv=990	no=8	da=2

Some quick analysis:

The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!
The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.

When language-detection was wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it's not the case that they are all always wrong on the short texts.

This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.

You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with language-detection choice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!

Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:

CLD	171 msec	16.331 MB/sec
`language-detection`	2367 msec	1.180 MB/sec
Tika	42219 msec	0.066 MB/sec

CLD is incredibly fast! language-detection is an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).

I used the 09-13-2011 release of language-detection, the current trunk (svn revision 1187915) of Apache Tika, and the current trunk (hg revision b0adee43f3b1) of CLD. All sources for the performance tests are available from here.

28 comments:

Otis GospodneticOctober 25, 2011 at 12:52 PM
Great information, especially the performance part! Thanks Mike!
ReplyDelete
Replies
F. Javier AlbaNovember 2, 2011 at 8:36 AM
wow! CLD is ultra-fast.

I'm gonna test it.

Thanks!
ReplyDelete
Replies
AnonymousNovember 10, 2011 at 10:39 AM
Why not developing a library just for python ?
ReplyDelete
Replies
Michael McCandlessNovember 11, 2011 at 6:45 AM
Hi Anonymous,

Here's a "pure Python" language detector: https://github.com/saffsd/langid.py

Its accuracy is very good; see: http://blog.mikemccandless.com/2011/10/language-detection-with-googles-compact.html?showComment=1319806660503#c569166902196965255
ReplyDelete
Replies
Cosmin PopaNovember 15, 2011 at 9:56 AM
Tanks for the reply on langid.py
Have you tested it's performance on small chunks of text, or just plain words?
I've made a few tests and the accuracy is not that great actually, do you have any idea if the module is based only on large texts or .. ?

Thanks again!

And about the CLD, i have tried to install it on Windows XP SP2 and the "build.win.cmd" gives me some errors, egg:
Could Not Find C:\chromium-compact-language-detector\libcld.lib

'cl.exe' is not recognized as an internal or external command, operable program or batch file.

'lib.exe' is not recognized as an internal or external command, operable program or batch file.

Can you help me on this problem ?
ReplyDelete
Replies
Michael McCandlessNovember 15, 2011 at 12:42 PM
Hi Cosmin,

I don't know much about langid.py (maybe try contacting its author?). It did test very well for me on the Europarl corpus though...

On your compilation errors, you need to have Visual Studio installed, and the tools have to be on your "path". Visual Studio includes a .bat file to setup your path... to verify it's working, just type "cl.exe" at the command prompt and confirm the executable is found.
ReplyDelete
Replies
Marco LuiNovember 30, 2011 at 2:16 AM
Hello Mike

I just spotted this post of yours now. Thanks for sharing this detailed comparison. Your presentation of results is very clear and concise.

@Cosmin I am the author of langid.py. I am working on a large-scale comparison of different langid tools, on a variety of datasets with different characteristics. Language identification of very short text segments is being discussed quite alot due to the popularity of services like Twitter, so I will be including tests aimed specifically at investigating performance over Twitter data. Preliminary results indicate that for most tools, performance is quite far below the long-text levels.
ReplyDelete
Replies
Daniel LindmarkJanuary 25, 2012 at 11:07 AM
Hi Mike!
Thanks for sharing this! I'm just curious to know if there is the possibility in CLD to re-train the language models/tables?

Daniel
ReplyDelete
Replies
Michael McCandlessJanuary 25, 2012 at 11:49 AM
Hi Daniel,

I imagine the Chromium team knows how to generate the tables from training data... but I sure don't know how! It's a great question.

Maybe try emailing the chromium-dev group? (http://groups.google.com/a/chromium.org/group/chromium-dev/topics )
ReplyDelete
Replies
Daniel LindmarkJanuary 26, 2012 at 4:16 AM
Ok. Thanks!
ReplyDelete
Replies
UnknownOctober 12, 2012 at 10:26 PM
Great comparison.
Very useful for me.

Thanks a lot!!!
ReplyDelete
Replies
RuudNovember 17, 2012 at 2:02 AM
I tested the Google stuff, and confusion between Afrikaans and Dutch was very bad indeed.
Language profile seem to have been made from Wikipedia, which is by design full of non-native content and proper names. Maybe that is bad training material. Results from training the Frysian profile from the Frysian wiki supports this idea. The Dutch Wikipedia is not very big, a better corpus might help.

Ik can find no info at all on how to train new languages to TIKA. Is there any?
ReplyDelete
Replies
Michael McCandlessNovember 18, 2012 at 11:34 AM
Hi Ruud,

I think Google's CLD was trained on a wider corpus than Wikipedia.

I don't know much about how Tika's language detection is trained ... if you work it out please submit a patch describing it! But I think Tika really should just use one of the existing language detectors ...
ReplyDelete
Replies
UnknownApril 21, 2013 at 7:28 AM
Great comparison, Mike. I used your results to compare the accuracy of my own language identification web service API (WhatLanguage.net). I also added langid.py in the mix, another popular library. You can find an accuracy comparison between WhatLanguage.net, CLD, Tika, language-detection and langid at http://www.whatlanguage.net/en/api/accuracy_language_detection
ReplyDelete
Replies
Michael McCandlessApril 21, 2013 at 7:57 AM
Hi Nick,

Those are impressive results! Do you describe your approach / share the source code anywhere?
ReplyDelete
Replies
AnonymousSeptember 11, 2013 at 10:04 PM
Hi Mike,
Could you please tell me which one is the best for short texts? I had a problem with language-detection for shorter messages. I heard about Twitter’s machine language detection algorithm and they say it is way accurate for the short texts.

Waiting for your suggestions (of course reply)
ReplyDelete
Replies
Junior DevFebruary 4, 2014 at 10:37 AM
Anyone know where I can find a detailed explanation of how the library works. I mean, for example, the criteria used to choose the best language, which hash function used to store n-gram, etc. It would be helpful ..
ReplyDelete
Replies
Michael McCandlessFebruary 5, 2014 at 5:47 AM
Hi Denys,

Try reading the pages at CLD2's wiki? https://code.google.com/p/cld2/w/list
ReplyDelete
Replies
Junior DevFebruary 5, 2014 at 11:54 AM
Thank you very much Michael, I'll check it.
ReplyDelete
Replies
The Red SerpentMarch 30, 2015 at 4:11 AM
I am trying to compile CLD2 in Visual Studio 2013. I am actually trying to create a .NET Wrapper for the library so I have added all the source files inside my CLR project.

Now whenever I compile I get these linking errors.

error LNK2005: "struct CLD2::CLD2TableSummary const CLD2::kCjkDeltaBi_obj" (?kCjkDeltaBi_obj@CLD2@@3UCLD2TableSummary@1@B) already defined in cld_generated_cjk_delta_bi_32.obj

These all seems to be related as I can see a relation between the 'generated' files.

Problem is I have a lot of these and I am not sure which ones I should exclude and which I should keep and use in my code.

Here is a list all the generated files that came with the CLD2 code.

cld_generated_cjk_uni_prop_80.cc
cld_generated_score_quad_octa_2.cc
cld_generated_score_quad_octa_0122.cc
cld_generated_score_quad_octa_0122_2.cc
cld_generated_score_quad_octa_1024_256.cc
cld_generated_cjk_delta_bi_4.cc
cld_generated_cjk_delta_bi_32.cc
cld2_generated_octa2_dummy.cc
cld2_generated_quad0122.cc
cld2_generated_quad0720.cc
cld2_generated_quadchrome_2.cc
cld2_generated_quadchrome_16.cc
cld2_generated_cjk_compatible.cc
cld2_generated_deltaocta0122.cc
cld2_generated_deltaocta0527.cc
cld2_generated_deltaoctachrome.cc
cld2_generated_distinctocta0122.cc
cld2_generated_distinctocta0527.cc
cld2_generated_distinctoctachrome.cc

The naming convention of these suggests that I should only be using one of each group. At least that how I think I should use it as I am not really an expert in encoding nor in how CLD2 works. And I could not find any references online explaining how to configure it.

I tried eliminating the linking errors by keeping only one of each generated group:

for example: from `cld_generated_cjk_delta_bi_4` and `cld_generated_cjk_delta_bi_32` I kept the 32 version. And so on for the rest of the files.

Now this made CLD compile yet when I tried testing it with languages I noticed that the scores were way way off and it was behaving inexplicably bad.

I am not trying to support all languages I only need to support latin languages along with hebrew, arabic, japanese and chinese.

Can someone please explain how to configure CLD2 to compile and work correctly.
ReplyDelete
Replies
YogiJanuary 7, 2016 at 4:22 AM
This is great stuff!, I want to totally adopt the voting approach to boost the accuracy of my application, however I can't find the maven package for the CLD/CLD2. Anyone knows where it is ?
ReplyDelete
Replies
Alex OttOctober 5, 2017 at 1:31 PM
btw, here are results of evaluation of language detection with fastText: http://alexott.blogspot.de/2017/10/evaluating-fasttexts-models-for.html
ReplyDelete
Replies

Add comment