Eric Fischer used CLD to create the colorful Twitter language map, and since then further language maps have appeared, e.g. for New York and London. What a multi-lingual world we live in!
Suddenly, just a few weeks ago, I received an out-of-the-blue email from Dick Sites, creator of CLD, with great news: he was finishing up version 2.0 of CLD and had already posted the source code on a new project.
So I've now reworked the Python bindings and ported the unit tests to Python (they pass!) to take advantage of the new features. It was much easier this time around since the CLD2 sources were already pulled out into their own project (thank you Dick and Google!).
There are a number of improvements over the previous version of CLD:
- Improved accuracy.
- Upgraded to Unicode 6.2 characters.
- More languages detected: 83 languages, up from 64 previously.
- A new "full language table" detector, available in Python as a
separate cld2full module, that detects 161 languages. This
increases the C library size from 1.8 MB (for 83 languages) to
5.5 MB (for 161
languages). Details
are here.
- An option to identify which parts (byte ranges) of the text
contain which language, in case the application needs to do
further language-specific processing. From Python, pass the
optional
returnVectors=True
argument to get the byte ranges, but note that this requires additional non-trivial CPU cost. This wiki page shows very interesting statistics on how frequently different languages appear in one page, across top web sites, showing the importance of handling multiple languages in a single text input.
- A new
hintLanguageHTTPHeaders
parameter, which you can pass from theContent-Language
HTTP header. Also, CLD2 will spot any lang=X attribute inside the<html>
tag itself (if you pass it HTML).
The
detect
function returns up to 3 top detected
languages. Each detected language includes the percent of the text
that was detected as the language, and a confidence score. The
function no longer returns a single "picked" summary language, and the
pickSummaryLanguage
option has been removed: this option
was apparently present for internal backwards compatibility reasons
and did not improve accuracy.
Remember that the provided input must be valid UTF-8 bytes, otherwise all sorts of things could go wrong (wrong results, segmentation fault).
To see the list of detected languages, just run this
python -c
"import cld2; print cld2.DETECTED_LANGUAGES"
, or python -c
"import cld2full; print cld2full.DETECTED_LANGUAGES"
to see the
full set of languages.
The README gives details on how to build and install CLD2.
Once again, thank you Google, and thank you Dick Sites for making this very useful library available to the world as open-source.
Hi!
ReplyDeleteToday I spent some time to compare the quality of few python libs for detecting languages from tracks name (really short text).
Unfortunately, according to my tests, CLD2 gave me really weird results. :(
Some examples:
"Meadowlake Street" is PL
cld2.detect("13 Años")
(False, 10, (('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0), ('Unknown', 'un', 0, 0.0)))
CLD2 was not able to detect language for "I love music" too.
Other libs are way slower than CLD2, but they gave me better results.
I'm wondering if something was bad with my installation, or someone else gets the same bad matching for the strings I posted here.
I run my tests against millions of strings, not just few.
Indeed I see the same results as you; I think CLD2 is just not designed for short text.
DeleteYou could try opening an issue at https://code.google.com/p/cld2/ and see what they say?
so what is the best library for small text lang recognition?
DeleteMaybe try https://github.com/saffsd/langid.py ? But in general, working well on short text is challenging for any language detector.
DeleteHi ,
ReplyDeleteHow to retrieve the matched tokens against the Lucene Query fired from index file ...This is am doing for auto complete ...suggest me is there any alternative way for auto complete in lucene ...
Hi, I don't fully understand the question. Can you re-ask this, with more details, on java-user@lucene.apache.org?
DeleteHI,
ReplyDeleteThank you for this tutorial, it works perfectly !
question, how to interface with PHP ?
I searched the documentation, but I can not find anything with CLD2 …
when i test with python the file test.py, it's OK, but with php:
Fatal error: Class 'CLD\Detector' not found
an idea ?
thx for your help.
Eric
Maybe you can try to compile compile_full.sh from "internal" folder and call it from command line using PHP? That's what I did with Perl...
Delete(call it = call the compiled file - compact_lang_det_test_full)*
DeleteYou mentioned .NET bindings. Are they available somewhere publicly? I haven't been able to find any.
ReplyDeleteAlas I can't find the .NET bindings either ... I remember seeing them at one point ...
DeleteI have created a managed library https://github.com/diadistis/cld2.net and a nuget package https://www.nuget.org/packages/CLD2.Net/ in case anyone is still interested.
DeleteThis is great! Thanks for your time and effort. Any plans to update the module in PyPi?
ReplyDeleteAlas I probably won't push new releases to PyPi: no time!
DeleteIf I understand it correctly, pycldmodule.cc tries to include compact_lang_det.h, which is missing from trunk/internal. Could you please fix this or suggest a workaround? Thanks
ReplyDeleteHi Márton,
DeleteYou need to install and build cld2 first, which contains compact_lang_det.h ... see the README.
I followed the README: I sourced cld2/internal/compile_libs.sh, which created cld2/internal/libcld2.so and cld2/internal/libcld2_full.so. I added cld2/internal to my LD_LIBRARY_PATH, edited setup.py and setup_full.py, and tried python setup.py build which failed:
Deletepycldmodule.cc:22:30: fatal error: compact_lang_det.h: No such file or directory
There are files with similar names in cld2/internal e.g.
compact_lang_det.cc, compact_lang_det_hint_code.h, compact_lang_det_impl.h, compact_lang_det_hint_code.cc, compact_lang_det_impl.cc, compact_lang_det_test.cc
but no compact_lang_det.h.
Under CLD2 I see public/compact_lang_det.h ... it's checked into source control (svn)?
DeleteWhen I try to run "NCLD2.LanguageDetection.GetLanguageDetectionScores("你好吗");" I get following error:
ReplyDeleteAdditional information: Unable to load DLL 'Interop_x86.CLD2': The specified module could not be found. (Exception from HRESULT: 0x8007007E)
Windows 7 64bit, Visual Studio 2015, .NET v4.6.1.
Changing console application's target to 64bit, or changing .NET version doesn't help.
Do you have any ideas why it's not working?
Thanks!