Monday, October 24, 2011

Additions to Compact Language Detector API


I've made some small improvements after my quick initial port of Google's Compact Language Detection Library, starting with some helpful Python constants:

  • cld.ENCODINGS has all the encoding names recognized by CLD; if you pass the encoding hint it must be one of these.

  • cld.LANGUAGES has the list of all base languages known (but not necessarily detectable) by CLD.

  • cld.EXTERNAL_LANGUAGES has the list of external languages known (but not necessarily detectable) by CLD.

  • cld.DETECTED_LANGUAGES has the list of detectable languages.

I haven't found a reliable way to get the full list of detectable languages;  for now, I've started with all languages that are covered by the unit test, total count 75, which should be a lower bound on the true count.

I also exposed control over whether CLD should abstain from a given matched language if the confidence is too low, by adding a parameter removeWeakMatches (required in C and optional in Python, default False).  Turn this option on if abstaining is OK in your use case, such as a browser toolbar offering to translate content.  Turn it off when testing accuracy vs other language detection libraries (unless they also abstain!).

Finally, CLD has an algorithm that tries to pick the best "summary" language, and it doesn't always just pick the highest scoring match. For example, the code has this comment:
    // If English and X, where X (not UNK) is big enough,
    // assume the English is boilerplate and return X.
See the CalcSummaryLanguage function for more details!

I found this was hurting accuracy in testing so I added a parameter pickSummaryLanguage (default False) to also turn this on or off.

Finally, I fixed the Python binding to release the GIL while CLD is running, so multiple threads can now detect without falsely blocking one another.

No comments:

Post a Comment