Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification

1 minute read

by Melville et al. (KDD 2009)



  • before the rise of the Web 2.0 companies published product information and reviews on Web sites that were under their direct sphere of influence.
  • nowadays, the focus of such discussions has shifted away from the company controlled Web sites into the blogosphere and social media where essentially anyone can comment on products and, therefore, influence purchase decisions.
  • to cope with such distributed discussions we need to address the following three challenging data mining tasks:
    • detect relevant discussion on products and relevant higher level concepts
    • identify the most authoritative and influential contributors
    • determine the sentiment of the discussion


General Remarks

  • Pang et al. have shown that the use of lexicons is not as effective as learning models from training examples.
  • Pooling information is a general approach from the field of Risk Analysis that combines information from multiple "experts", where each expert is usually represented by a probability distribution.
  • typical pooling approaches are linear and logarithmic pooling which compute, depending on the used weights, the mean or the geometric mean of the involved distributions.
  • Melville et al. adapt the expert's weights with a sigmoid weighting scheme, that considers the expert's past error rate: \[\alpha_k = log \frac{1-err_k}{err_k}\]

Naive Bayes

  • The authors present an approach for deriving $$P(w_i|+)$$ and $$P(w_i|-)$$ from the entries of a sentiment lexicon, and
  • use Lidstone smoothing with $$\epsilon = 10^{-6}$$ which tends to yield better probabilities than the standard Laplace smoothing: \[ P(w_i|c_j) = \frac{t_{ij}+\epsilon}{\sum_i t_{ij} + \epsilon|\mathcal{V}|} \] where $$\mathcal{V}$$ is the size of the vocabulary of the domain.
In the evaluation the author show that the combined approach up-weights and down-weights the sentiment values of lexicon entries. For instance, terms such as dark, social, complex, anger, alien and capture are up-weighted for the Movie domain, while talent, reason, promise, save, fair and redeem are down-weighted for this domain.