To get a sense of the accuracy and performance of Google's
Compact
Language Detector, I ran some tests against two other packages:
For the test corpus I used a
the
corpus described here, created by the author of
language-detection
. It contains 1000 texts from each of
21 languages, randomly sampled from the
Europarl corpus.
It's not a perfect test (no test ever is!): the content is already
very clean plain text; there are no domain, language, encoding hints
to apply (which you'd normally have with HTML content loaded over
HTTP); it "only" covers 21 languages (versus at least 76 that CLD can
detect).
CLD and
language-detection
cover all 21 languages, but
Tika is missing Bulgarian (
bg
), Czech (
cs
),
Lithuanian (
lt
) and Latvian (
lv
), so I only
tested on the remaining subset of 17 languages that all three detectors
support. This works out to 17,000 texts totalling 2.8 MB.
Many of the texts are very short, making the test challenging: the
shortest is 25 bytes, and 290 (1.7%) of the 17000 are 30 bytes or
less.
In addition to the challenges of the corpora, the differences in the
detectors make the comparison somewhat apples to oranges. For
example, CLD detects at least 76 languages, while
language-detection
detects 53 and Tika detects 27,
so this biases against CLD, and
language-detection
to a
lesser extent, since their classification task is harder relative to
Tika's.
For CLD, I disabled its
option
to abstain (
removeWeakMatches
), so that it always
guesses at the language even when confidence is low, to match the
other two detectors. I also turned off the
pickSummaryLanguage
, as this was also hurting accuracy;
now CLD simply picks the highest scoring match as the detected
language.
For
language-detection
, I ran with the default
ALPHA
of 0.5, and set the random seed to 0.
Here are the raw results:
CLD results (total 98.82% = 16800 / 17000):
da |
93.4% |
da=934 |
nb=54 |
sv=5 |
fr=2 |
eu=2 |
is=1 |
hr=1 |
en=1 |
|
de |
99.6% |
de=996 |
en=2 |
ga=1 |
cy=1 |
|
|
|
|
|
el |
100.0% |
el=1000 |
|
|
|
|
|
|
|
|
en |
100.0% |
en=1000 |
|
|
|
|
|
|
|
|
es |
98.3% |
es=983 |
pt=4 |
gl=3 |
en=3 |
it=2 |
eu=2 |
id=1 |
fi=1 |
da=1 |
et |
99.6% |
et=996 |
ro=1 |
id=1 |
fi=1 |
en=1 |
|
|
|
|
fi |
100.0% |
fi=1000 |
|
|
|
|
|
|
|
|
fr |
99.2% |
fr=992 |
en=4 |
sq=2 |
de=1 |
ca=1 |
|
|
|
|
hu |
99.9% |
hu=999 |
it=1 |
|
|
|
|
|
|
|
it |
99.5% |
it=995 |
ro=1 |
mt=1 |
id=1 |
fr=1 |
eu=1 |
|
|
|
nl |
99.5% |
nl=995 |
af=3 |
sv=1 |
et=1 |
|
|
|
|
|
pl |
99.6% |
pl=996 |
tr=1 |
sw=1 |
nb=1 |
en=1 |
|
|
|
|
pt |
98.7% |
pt=987 |
gl=4 |
es=3 |
mt=1 |
it=1 |
is=1 |
ht=1 |
fi=1 |
en=1 |
ro |
99.8% |
ro=998 |
da=1 |
ca=1 |
|
|
|
|
|
|
sk |
98.8% |
sk=988 |
cs=9 |
en=2 |
de=1 |
|
|
|
|
|
sl |
95.1% |
sl=951 |
hr=32 |
sr=8 |
sk=5 |
en=2 |
id=1 |
cs=1 |
|
|
sv |
99.0% |
sv=990 |
nb=9 |
en=1 |
|
|
|
|
|
|
Tika results (total 97.12% = 16510 / 17000):
da |
87.6% |
da=876 |
no=112 |
nl=4 |
sv=3 |
it=1 |
fr=1 |
et=1 |
en=1 |
de=1 |
|
|
|
|
de |
98.5% |
de=985 |
nl=3 |
it=3 |
da=3 |
sv=2 |
fr=2 |
sl=1 |
ca=1 |
|
|
|
|
|
el |
100.0% |
el=1000 |
|
|
|
|
|
|
|
|
|
|
|
|
en |
96.9% |
en=969 |
no=10 |
it=6 |
ro=4 |
sk=3 |
fr=3 |
hu=2 |
et=2 |
sv=1 |
|
|
|
|
es |
89.8% |
es=898 |
gl=47 |
pt=22 |
ca=15 |
it=6 |
eo=4 |
fr=3 |
fi=2 |
sk=1 |
nl=1 |
et=1 |
|
|
et |
99.1% |
et=991 |
fi=4 |
fr=2 |
sl=1 |
no=1 |
ca=1 |
|
|
|
|
|
|
|
fi |
99.4% |
fi=994 |
et=5 |
hu=1 |
|
|
|
|
|
|
|
|
|
|
fr |
98.0% |
fr=980 |
sl=6 |
eo=3 |
et=2 |
sk=1 |
ro=1 |
no=1 |
it=1 |
gl=1 |
fi=1 |
es=1 |
de=1 |
ca=1 |
hu |
99.9% |
hu=999 |
ca=1 |
|
|
|
|
|
|
|
|
|
|
|
it |
99.4% |
it=994 |
eo=4 |
pt=1 |
fr=1 |
|
|
|
|
|
|
|
|
|
nl |
97.8% |
nl=978 |
no=8 |
de=3 |
da=3 |
sl=2 |
ro=2 |
pl=1 |
it=1 |
gl=1 |
et=1 |
|
|
|
pl |
99.1% |
pl=991 |
sl=3 |
sk=2 |
ro=1 |
it=1 |
hu=1 |
fi=1 |
|
|
|
|
|
|
pt |
94.4% |
pt=944 |
gl=48 |
hu=2 |
ca=2 |
it=1 |
et=1 |
es=1 |
en=1 |
|
|
|
|
|
ro |
99.3% |
ro=993 |
is=2 |
sl=1 |
pl=1 |
it=1 |
hu=1 |
fr=1 |
|
|
|
|
|
|
sk |
96.2% |
sk=962 |
sl=21 |
pl=13 |
it=2 |
ro=1 |
et=1 |
|
|
|
|
|
|
|
sl |
98.5% |
sl=985 |
sk=7 |
et=4 |
it=2 |
pt=1 |
no=1 |
|
|
|
|
|
|
|
sv |
97.1% |
sv=971 |
no=15 |
nl=6 |
da=6 |
de=1 |
ca=1 |
|
|
|
|
|
|
|
Language-detection
results (total 99.22% = 16868 / 17000):
da |
97.1% |
da=971 |
no=28 |
en=1 |
|
|
|
de |
99.8% |
de=998 |
da=1 |
af=1 |
|
|
|
el |
100.0% |
el=1000 |
|
|
|
|
|
en |
99.7% |
en=997 |
nl=1 |
fr=1 |
af=1 |
|
|
es |
99.5% |
es=995 |
pt=4 |
en=1 |
|
|
|
et |
99.6% |
et=996 |
fi=2 |
de=1 |
af=1 |
|
|
fi |
99.8% |
fi=998 |
et=2 |
|
|
|
|
fr |
99.8% |
fr=998 |
sv=1 |
it=1 |
|
|
|
hu |
99.9% |
hu=999 |
id=1 |
|
|
|
|
it |
99.8% |
it=998 |
es=2 |
|
|
|
|
nl |
97.7% |
nl=977 |
af=21 |
sv=1 |
de=1 |
|
|
pl |
99.9% |
pl=999 |
nl=1 |
|
|
|
|
pt |
99.4% |
pt=994 |
es=3 |
it=1 |
hu=1 |
en=1 |
|
ro |
99.9% |
ro=999 |
fr=1 |
|
|
|
|
sk |
98.7% |
sk=987 |
cs=8 |
sl=2 |
ro=1 |
lt=1 |
et=1 |
sl |
97.2% |
sl=972 |
hr=27 |
en=1 |
|
|
|
sv |
99.0% |
sv=990 |
no=8 |
da=2 |
|
|
|
Some quick analysis:
- The language-detection library gets the best accuracy, at 99.22%,
followed by CLD, at 98.82%, followed by Tika at 97.12%.
Net/net these accuracies are very good, especially considering how
short some of the tests are!
- The difficult languages are Danish (confused with Norwegian),
Slovene (confused with Croatian) and Dutch (for Tika and
language-detection
). Tika in particular has trouble
with Spanish (confuses it with Galician). These confusions are to
be expected: the languages are very similar.
When
language-detection
was wrong, Tika was also
wrong 37% of the time and CLD was also wrong 23% of the time. These
numbers are quite low! It tells us that the errors are somewhat
orthogonal, i.e. the libraries tend to get different test cases wrong.
For example, it's not the case that they are all always wrong on the short
texts.
This means the libraries are using different overall signals to
achieve their classification (for example, perhaps they were trained
on different training texts). This is encouraging since it means, in
theory, one could build a language detection library combining the
signals of all of these libraries and achieve better overall accuracy.
You could also make a simple majority-rules voting system across these
(and other) libraries. I tried exactly that approach: if any language
receives 2 or more votes from the three detectors, select that as the
detected language; otherwise, go with
language-detection
choice. This gives the best accuracy of all: total 99.59% (= 16930 /
17000)!
Finally, I also separately tested the run time for each package. Each
time is the best of 10 runs through the full corpus:
CLD | 171 msec | 16.331 MB/sec |
language-detection | 2367 msec | 1.180 MB/sec |
Tika | 42219 msec | 0.066 MB/sec |
CLD is incredibly fast!
language-detection
is an order
of magnitude slower, and Tika is another order of magnitude slower
(not sure why).
I used
the
09-13-2011 release of
language-detection
, the current
trunk (svn revision 1187915) of
Apache Tika,
and the current trunk (hg revision b0adee43f3b1) of
CLD.
All sources for the performance tests are
available from here.