Generating High-Coverage Semantic Orientation Lexicons from Overtly Marked Words and a Thesaurus

1 minute read

by Saif et al.

1. General

  • confirms the Polyanna Hypothesis which states that people have a preference for using positive words and expressions suggesting that we should be able to find more positive than negative sentiment terms.
  • Turney and Littman - obtained the sentiment of a word by determining if it co-occurs more often with a set of positive words than with a set of negative words.
2. Method

The method consists of two steps:

  1. identify a seed set of positive and negative words by using affix patterns (dis-, im-, in-, mal-, mis-, non-, un-, -less, ill-, ir-, -ful -- e.g. honest - dishonest; happy - unhappy, ...)
  2. use a thesaurus to mark the words synonymous with the positive set "positive" and the words synonymous with the negative set "negative"
2.1. Seed words

Based on the general inquirer or on affix patterns as described below:

  1. marked words such as dishonest, unhappy, inpure, ... -> negative
  2. unmarked words such as happy, honest, ... -> positive (there are some exceptions but they are clearly in the minority)
2.2. Generalization

Mark the thesaurus's paragraphs and terms positive or negative, depending on which group of words occurs more often.

3. Evaluation:

The method is evaluated by

  1. computing the overlap with a manually generated lexicon (general inquirer)
  2. comparing the performance of the lexicon in evaluating the MPQA corpus (http://www.cs.pitt.edu/mpqa)
4. Visualization

  1. based on a force-directed graph layout which shows the original sentiment lexicon and its extensions.
  2. the authors use the NodeXL (http://www.codeplex.com/NodeXL) tool to create visualizations which
    • contain a smaller subset of the lexicon
    • and have been cleaned up using graph-theoretic metrics.