Text Mining - Predictive Methods for analyzing unstructured information

1 minute read

by Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau.

The book gives a very good beginner's introduction to text mining. In the first chapter some basic concepts (vector space model, clustering, ...) are introduced. The second chapter focuses on the transformation of text to vector space representations. Techniques like tokenization, word-stemming, sparse-vectors and concepts like tf-idf, collocations are explained and pseudocode for various tasks is presented (which makes this book a good source for harvesting ideas how to improve _your_ nlp-application ;) Chapters 3 to 6 are concerned with the methodology used in information retrieval, presenting important methods. These chapters conclude with suggestions for testing the method's performance, practical applications and bibliographical remarks (conferences, important papers, etc.). The third chapter covers the following methods for text classification

  • similarity and nearest neighbor
  • logic methods (decision rules)
  • probabilistic methods (simple Bayes and its formulation as multivariate Bernoulli model) and
  • weighted-scoring methods

Chapter four is concerned with document search - methods for determining similar documents (cosine similarity, word count and bonus, shared word, etc.), page rank for result ranking and the use of inverted word lists is presented. The fifth chapter gives an introduction into document (and composite) clustering. k-means clustering, centroid classifiers, hierarchical clustering and the em algorithm are explained. In the final section of the chapter several heuristics for assigning labels to identified clusters are presented. Finally the sixth chapter focuses on information extraction (e.g. entity extraction). Sequential Tagging, the Maximum Entropy Method (which copes with the major disadvantages of naive Bayes (see chapter 6.2.3) and sequential probability modelling are presented. Afterwards coreference and relationship extraction and the more general template filling (compare Hearst-patterns) are introduced. Chapter seven presents several case studies for the use of ir methods, while chapter eight focuses on emerging research areas (like active learning, learning with unlabeled data, multiple sample and voting methods, etc.) in the ir field.