The viability of web-derived polarity lexicons

1 minute read

by Velikovich et al. (Google research)

This paper describes an approach for semi-automatically generating sentiment lexicon from seed terms and a Web corpus. The authors create a graph of phrases and use a graph propagation algorithm to determine how positive and negative polarity propagates through this graph. The graph has the following components:

  • node set V: contains the candidate phrases (=n-grams up to a length of ten)
  • set of edges E: connect two candidate phrases based on the cosine similarity of their context vector; the context vector is composed by aggregating the terms appearing within a six word window over all mentions of the phrase in the Web corpus

The authors discuss why their propagation method is expected to outperform labeled propagation, which works well with high quality data but not with noisy and untrustworthy graphs constructed from the Web. Evaluation Measure Finally, the authors present an evaluation which makes use of the following purity measure:

\[ purity(X) = \sum_{ x \in X } pol_x / ( \delta + \sum_{ x \in X } |pol_x| ) \]

Due to the parameter \delta this measure assigns higher scores to sentences which contain multiple sentiment phrases. This purity measure normalizes the polarity score to the range [-1,1] and yields higher values for sentences containing only positive or negative terms. Conclusion The evaluations show that the proposed method outperforms traditional lexicons because to generated dictionaries contain a wider range of phrases such as

  • spelling variations
  • slang
  • vulgarity, and
  • multi-word expressions
which have not been available to previous systems.