Posted on and Updated on

Reading list: Foundations of Statistical Natural Language Processing

Recently I was looking into making my NLP knowledge more solid and I found this book by reference: Foundations of Statistical Natural Language Processing. It’s a classic book and certainly it was a good read.

Now, the topics it discusses might sound quite theoretical, so let me translate them to few examples how each of them could be applied in your work.

First, you might find it interesting to read the introduction (chapters 1-4) in order to get acquainted with terminology and relevant maths. Next:

Chapter 5: Collocations. This chapter mainly focuses on identification and verification of collocations. What caught my attention is explanation of certain significance tests. Authors use them in order to prove that given collocation is indeed a collocation, whereas in general significance tests are used in order to prove that certain dependency is indeed significant. When used properly it can become a powerful tool in your hands.

Chapter 9: Markov Models. Generally, everything that is connected with fuzzy string matching and sequence identification is a for Markov Models. My favourite example is date extraction: recognize names and numbers, collect examples of date formats, train a Markov Model and you get a date extractor that can extract dates from plain text. Chapter 10 discusses in details a challenge for Markov Models: how they could be used for Part-of-Speech tagging.

Chapter 14: Clustering. Is one of my favourite topics. And it’s all because clustering is an unsupervised data mining method. Again, a lot of applications. Any automatic fetching and processing of web content sooner or later faces with clustering, because very often web sites contain almost identical pages or elements, but with subtle differences. Here clustering comes into play and does the job for you. Of course it involves a lot tuning and tweaking, but in the end clustering still stays in the core of the solution.

Chapter 16: Text Categorization. Here you’ll learn a lot about all these weights and profiles, decision trees and k-Nearest-Neighbor. But mainly categorization is about assigning a category to a text along with probability. Simplest example: categorize a web site. You don’t want to do it manually if you process hundreds per day of them.

Remember: this book might be too theoretical sometimes, but if you want to dive into the field of natural language processing this is a good starting point.