The book describe search techniques, including tokenization, indexing, suggest and spell correction. It also covers fuzzy string matching, named entity extraction (people, places, things), clustering, classification, tagging, and a question answering system (think Jeopardy). These topics are challenging!
N-gram processing (both character and word ngrams) is featured prominently, which makes sense as it is a surprisingly effective technique for a number of applications. The book includes helpful real-world code samples showing how to process text using modern open-source tools including OpenNLP, Tika, Lucene, Solr and Mahout.
The final chapter, "Untamed Text", is especially fun: the sections, some of which are contributed by additional authors, address very challenging topics like semantics extraction, document summarization, relationship extraction, identifying important content and people, detecting emotions with sentiment analysis and cross-language information retrieval.
There were a few topics I expected to see but seemed to be missing. There was no coverage of the Unicode standard (e.g. encodings, and useful standards such as UAX#29 text segmentation). Multi-lingual issues were not addressed; all examples are English. Finite-state transducers were also missing, even though these are powerful tools for text processing. Lucene uses FSTs in a number of places: efficient synonym-filtering, character filtering during analysis, fast auto-suggest, tokenizing Japanese text, in-memory postings format. Still, it's fine that some topics are missing: text processing is an immense field and something has to be cut!
The book is unfortunately based on Lucene/Solr 3.x, so new features only in Lucene/Solr 4.0 are missing, for example the new
DirectSpellChecker, scoring models beyond TF/IDF Vector Space Model. Chapter 4, Fuzzy text searching, didn't mention Lucene's new
FuzzyQuerynor the very fast Levenshtein Automata approach it uses for finding all fuzzy matches from a large set of terms.
All in all the book is a great introduction to how to leverage numerous open-source tools to process text.
Got the Manning Early Access Program version months ago, but never delved in very far. This review is a wake up call to finally read the book! Still no Kindle version in my account, but the website says January 15.ReplyDelete
Too bad multi-lingual issues are not covered, since I except them to be my biggest issues of 2013.