Changing Bits: Compact Language Detector

Showing posts with label Compact Language Detector. Show all posts

Friday, August 2, 2013

A new version of the Compact Language Detector

It's been almost two years since I originally factored out the fast and accurate Compact Language Detector from the Chromium project, and the effort was clearly worthwhile: the project is popular and others have created additional bindings for languages including at least Perl, Ruby, R, JavaScript, PHP and C#/.NET.

Eric Fischer used CLD to create the colorful Twitter language map, and since then further language maps have appeared, e.g. for New York and London. What a multi-lingual world we live in!

Suddenly, just a few weeks ago, I received an out-of-the-blue email from Dick Sites, creator of CLD, with great news: he was finishing up version 2.0 of CLD and had already posted the source code on a new project.

So I've now reworked the Python bindings and ported the unit tests to Python (they pass!) to take advantage of the new features. It was much easier this time around since the CLD2 sources were already pulled out into their own project (thank you Dick and Google!).

There are a number of improvements over the previous version of CLD:

Improved accuracy.
Upgraded to Unicode 6.2 characters.
More languages detected: 83 languages, up from 64 previously.
A new "full language table" detector, available in Python as a separate cld2full module, that detects 161 languages. This increases the C library size from 1.8 MB (for 83 languages) to 5.5 MB (for 161 languages). Details are here.
An option to identify which parts (byte ranges) of the text contain which language, in case the application needs to do further language-specific processing. From Python, pass the optional returnVectors=True argument to get the byte ranges, but note that this requires additional non-trivial CPU cost. This wiki page shows very interesting statistics on how frequently different languages appear in one page, across top web sites, showing the importance of handling multiple languages in a single text input.
A new hintLanguageHTTPHeaders parameter, which you can pass from the Content-Language HTTP header. Also, CLD2 will spot any lang=X attribute inside the <html> tag itself (if you pass it HTML).

In the new Python bindings, I've exposed CLD2's debug* flags, to add verbosity to CLD2's detection process. This document describes how to interpret the resulting output.

The detect function returns up to 3 top detected languages. Each detected language includes the percent of the text that was detected as the language, and a confidence score. The function no longer returns a single "picked" summary language, and the pickSummaryLanguage option has been removed: this option was apparently present for internal backwards compatibility reasons and did not improve accuracy.

Remember that the provided input must be valid UTF-8 bytes, otherwise all sorts of things could go wrong (wrong results, segmentation fault).

To see the list of detected languages, just run this

python -c
"import cld2; print cld2.DETECTED_LANGUAGES"

, or

python -c
"import cld2full; print cld2full.DETECTED_LANGUAGES"

to see the full set of languages.

The README gives details on how to build and install CLD2.

Once again, thank you Google, and thank you Dick Sites for making this very useful library available to the world as open-source.

Tuesday, October 25, 2011

Accuracy and performance of Google's Compact Language Detector

To get a sense of the accuracy and performance of Google's Compact Language Detector, I ran some tests against two other packages:

Apache Tika, implemented in Java, using its LanguageIdentification class
language-detection, a project on Google code, also implemented in Java

For the test corpus I used a the corpus described here, created by the author of language-detection. It contains 1000 texts from each of 21 languages, randomly sampled from the Europarl corpus.

It's not a perfect test (no test ever is!): the content is already very clean plain text; there are no domain, language, encoding hints to apply (which you'd normally have with HTML content loaded over HTTP); it "only" covers 21 languages (versus at least 76 that CLD can detect).

CLD and language-detection cover all 21 languages, but Tika is missing Bulgarian (bg), Czech (cs), Lithuanian (lt) and Latvian (lv), so I only tested on the remaining subset of 17 languages that all three detectors support. This works out to 17,000 texts totalling 2.8 MB.

Many of the texts are very short, making the test challenging: the shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or less.

In addition to the challenges of the corpora, the differences in the detectors make the comparison somewhat apples to oranges. For example, CLD detects at least 76 languages, while language-detection detects 53 and Tika detects 27, so this biases against CLD, and language-detection to a lesser extent, since their classification task is harder relative to Tika's.

For CLD, I disabled its option to abstain (removeWeakMatches), so that it always guesses at the language even when confidence is low, to match the other two detectors. I also turned off the pickSummaryLanguage, as this was also hurting accuracy; now CLD simply picks the highest scoring match as the detected language.

For language-detection, I ran with the default ALPHA of 0.5, and set the random seed to 0.

Here are the raw results:

CLD results (total 98.82% = 16800 / 17000):

da	93.4%	da=934	nb=54	sv=5	fr=2	eu=2	is=1	hr=1	en=1
de	99.6%	de=996	en=2	ga=1	cy=1
el	100.0%	el=1000
en	100.0%	en=1000
es	98.3%	es=983	pt=4	gl=3	en=3	it=2	eu=2	id=1	fi=1	da=1
et	99.6%	et=996	ro=1	id=1	fi=1	en=1
fi	100.0%	fi=1000
fr	99.2%	fr=992	en=4	sq=2	de=1	ca=1
hu	99.9%	hu=999	it=1
it	99.5%	it=995	ro=1	mt=1	id=1	fr=1	eu=1
nl	99.5%	nl=995	af=3	sv=1	et=1
pl	99.6%	pl=996	tr=1	sw=1	nb=1	en=1
pt	98.7%	pt=987	gl=4	es=3	mt=1	it=1	is=1	ht=1	fi=1	en=1
ro	99.8%	ro=998	da=1	ca=1
sk	98.8%	sk=988	cs=9	en=2	de=1
sl	95.1%	sl=951	hr=32	sr=8	sk=5	en=2	id=1	cs=1
sv	99.0%	sv=990	nb=9	en=1

Tika results (total 97.12% = 16510 / 17000):

87.6%

da=876

no=112

nl=4

sv=3

it=1

fr=1

et=1

en=1

de=1

98.5%

de=985

nl=3

it=3

da=3

sv=2

fr=2

sl=1

ca=1

100.0%

el=1000

96.9%

en=969

no=10

it=6

ro=4

sk=3

fr=3

hu=2

et=2

sv=1

89.8%

es=898

gl=47

pt=22

ca=15

it=6

eo=4

fr=3

fi=2

sk=1

nl=1

et=1

99.1%

et=991

fi=4

fr=2

sl=1

no=1

ca=1

99.4%

fi=994

et=5

hu=1

98.0%

fr=980

sl=6

eo=3

et=2

sk=1

ro=1

no=1

it=1

gl=1

fi=1

es=1

de=1

ca=1

99.9%

hu=999

ca=1

99.4%

it=994

eo=4

pt=1

fr=1

97.8%

nl=978

no=8

de=3

da=3

sl=2

ro=2

pl=1

it=1

gl=1

et=1

99.1%

pl=991

sl=3

sk=2

ro=1

it=1

hu=1

fi=1

94.4%

pt=944

gl=48

hu=2

ca=2

it=1

et=1

es=1

en=1

99.3%

ro=993

is=2

sl=1

pl=1

it=1

hu=1

fr=1

96.2%

sk=962

sl=21

pl=13

it=2

ro=1

et=1

98.5%

sl=985

sk=7

et=4

it=2

pt=1

no=1

97.1%

sv=971

no=15

nl=6

da=6

de=1

ca=1

Language-detection results (total 99.22% = 16868 / 17000):

da	97.1%	da=971	no=28	en=1
de	99.8%	de=998	da=1	af=1
el	100.0%	el=1000
en	99.7%	en=997	nl=1	fr=1	af=1
es	99.5%	es=995	pt=4	en=1
et	99.6%	et=996	fi=2	de=1	af=1
fi	99.8%	fi=998	et=2
fr	99.8%	fr=998	sv=1	it=1
hu	99.9%	hu=999	id=1
it	99.8%	it=998	es=2
nl	97.7%	nl=977	af=21	sv=1	de=1
pl	99.9%	pl=999	nl=1
pt	99.4%	pt=994	es=3	it=1	hu=1	en=1
ro	99.9%	ro=999	fr=1
sk	98.7%	sk=987	cs=8	sl=2	ro=1	lt=1	et=1
sl	97.2%	sl=972	hr=27	en=1
sv	99.0%	sv=990	no=8	da=2

Some quick analysis:

The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. Net/net these accuracies are very good, especially considering how short some of the tests are!
The difficult languages are Danish (confused with Norwegian), Slovene (confused with Croatian) and Dutch (for Tika and language-detection). Tika in particular has trouble with Spanish (confuses it with Galician). These confusions are to be expected: the languages are very similar.

When language-detection was wrong, Tika was also wrong 37% of the time and CLD was also wrong 23% of the time. These numbers are quite low! It tells us that the errors are somewhat orthogonal, i.e. the libraries tend to get different test cases wrong. For example, it's not the case that they are all always wrong on the short texts.

This means the libraries are using different overall signals to achieve their classification (for example, perhaps they were trained on different training texts). This is encouraging since it means, in theory, one could build a language detection library combining the signals of all of these libraries and achieve better overall accuracy.

You could also make a simple majority-rules voting system across these (and other) libraries. I tried exactly that approach: if any language receives 2 or more votes from the three detectors, select that as the detected language; otherwise, go with language-detection choice. This gives the best accuracy of all: total 99.59% (= 16930 / 17000)!

Finally, I also separately tested the run time for each package. Each time is the best of 10 runs through the full corpus:

CLD	171 msec	16.331 MB/sec
`language-detection`	2367 msec	1.180 MB/sec
Tika	42219 msec	0.066 MB/sec

CLD is incredibly fast! language-detection is an order of magnitude slower, and Tika is another order of magnitude slower (not sure why).

I used the 09-13-2011 release of language-detection, the current trunk (svn revision 1187915) of Apache Tika, and the current trunk (hg revision b0adee43f3b1) of CLD. All sources for the performance tests are available from here.

Monday, October 24, 2011

Additions to Compact Language Detector API

I've made some small improvements after my quick initial port of Google's Compact Language Detection Library, starting with some helpful Python constants:

cld.ENCODINGS has all the encoding names recognized by CLD; if you pass the encoding hint it must be one of these.
cld.LANGUAGES has the list of all base languages known (but not necessarily detectable) by CLD.
cld.EXTERNAL_LANGUAGES has the list of external languages known (but not necessarily detectable) by CLD.
cld.DETECTED_LANGUAGES has the list of detectable languages.

I haven't found a reliable way to get the full list of detectable languages; for now, I've started with all languages that are covered by the unit test, total count 75, which should be a lower bound on the true count.

I also exposed control over whether CLD should abstain from a given matched language if the confidence is too low, by adding a parameter removeWeakMatches (required in C and optional in Python, default False). Turn this option on if abstaining is OK in your use case, such as a browser toolbar offering to translate content. Turn it off when testing accuracy vs other language detection libraries (unless they also abstain!).

Finally, CLD has an algorithm that tries to pick the best "summary" language, and it doesn't always just pick the highest scoring match. For example, the code has this comment:

    // If English and X, where X (not UNK) is big enough,
    // assume the English is boilerplate and return X.

See the CalcSummaryLanguage function for more details!

I found this was hurting accuracy in testing so I added a parameter pickSummaryLanguage (default False) to also turn this on or off.

Finally, I fixed the Python binding to release the GIL while CLD is running, so multiple threads can now detect without falsely blocking one another.

Friday, October 21, 2011

Language detection with Google's Compact Language Detector

Google's Chrome browser has a useful translate feature, where it detects the language of the page you've visited and if it differs from your local language, it offers to translate it.

Wonderfully, Google has open-sourced most of Chrome's source code, including the embedded CLD (Compact Language Detector) library that's used to detect the language of any UTF-8 encoded content. It looks like CLD was extracted from the language detection library used in Google's toolbar.

It turns out the CLD part of the Chromium source tree is nicely standalone, so I pulled it out into a new separate Google code project, making it possible to use CLD directly from any C++ code.

I also added basic initial Python binding (one method!), and ported the small C++ unit test (verifying detection of known strings for 64 different languages) to Python (it passes!).

So detecting language is now very simple from Python:

    import cld
    topLanguageName = cld.detect(bytes)[0]

The detect method returns a tuple, including the language name and code (such as RUSSIAN, ru), an isReliable boolean (True if CLD is quite sure of itself), the number of actual text bytes processed, and then details for each of the top languages (up to 3) that were identified.

You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand.

You can also optionally provide hints to the detect method, including the declared encoding and language (for example, from an HTTP header or an embedded META http-equiv tag in the HTML), as well as the domain name suffix (so the top level domain suffix es would boost the chances for detecting Spanish). CLD uses these hints to boost the priors for certain languages. There is this fun comment in the code in front of the tables holding the per-language prior boots:

    Generated by dsites 2008.07.07 from 10% of Base

How I wish I too could build tables off of 10% of Base!

The code itself looks very cool and I suspect (but haven't formally verified!) its quite accurate. I only understand bits and pieces about how it works; you can read some details here and here.

It's also not clear just how many languages it can detect; I see there are 161 "base" languages plus 44 "extended" languages, but then I see many test cases (102 out of 166!) commented out. This was likely done to reduce the size of the ngram tables; possibly Google could provide the full original set of tables for users wanting to spend more RAM in exchange for detecting the long tail.

This port is all still very new, and I extracted CLD quickly, so likely there are some problems still to work out, but the fact that it passes the Python unit test is encouraging. The README.txt has some more details.

Thank you Google!