I've made some small improvements after my quick initial port of Google's Compact Language Detection Library, starting with some helpful Python constants:
cld.ENCODINGShas all the encoding names recognized by CLD; if you pass the encoding hint it must be one of these.
cld.LANGUAGEShas the list of all base languages known (but not necessarily detectable) by CLD.
cld.EXTERNAL_LANGUAGEShas the list of external languages known (but not necessarily detectable) by CLD.
cld.DETECTED_LANGUAGEShas the list of detectable languages.
I haven't found a reliable way to get the full list of detectable languages; for now, I've started with all languages that are covered by the unit test, total count 75, which should be a lower bound on the true count.
I also exposed control over whether CLD should abstain from a given matched language if the confidence is too low, by adding a parameter
removeWeakMatches(required in C and optional in Python, default
False). Turn this option on if abstaining is OK in your use case, such as a browser toolbar offering to translate content. Turn it off when testing accuracy vs other language detection libraries (unless they also abstain!).
Finally, CLD has an algorithm that tries to pick the best "summary" language, and it doesn't always just pick the highest scoring match. For example, the code has this comment:
// If English and X, where X (not UNK) is big enough, // assume the English is boilerplate and return X.See the CalcSummaryLanguage function for more details!
I found this was hurting accuracy in testing so I added a parameter
False) to also turn this on or off.
Finally, I fixed the Python binding to release the GIL while CLD is running, so multiple threads can now detect without falsely blocking one another.
Post a Comment