A Machine-Learning Approach to Negation and Speculation Detection for Sentiment Analysis

2 minute read

Cruz, Noa P., Maite Taboada, and Ruslan Mitkov. A Machine-Learning Approach to Negation and Speculation Detection for Sentiment Analysis. Journal of the Association for Information Science and Technology, June 1, 2015

Summary

This paper presents a two-stage approach for detecting negation and speculation in text documents: (a) a classifier identifies negation and speculation cues, and (b) an independent second classifier determines the scope of these cues. The authors also include a comprehensive literature review on negation and speculation detection and an evaluation of their approach.

Method

Since negation (speculation) appears only in 18% (23%) of the sentences, the system needs to address a classification problem on imbalanced data sets in which the algorithm is biased toward the majority class. Therefore, the authors draw upon a cost-sensitive SVM algorithm that assigns considerably higher cost to missclassifying a minory-class example to control the skew of the SVM optimization. As kernel, a Radial Basic Function (RBF) is used. The classifier's feature set has been determined with a greedy forward procedure.

  1. identification of negation and speculation cues: a support vector machine is trained to predict whether a term is located at the beginning of a cue (B), inside a cue (I) or outside of it (O). The corresponding BIO model allows detecting multiword cues such as can't or does not. Applying a postprocessing heuristic which sets the tag of the first word of a cue to "B" and ensures that "n't" obtains an "I" tag considerably improves the accuracy of the presented approach. The features set for this task is limited to lexical information on the term and its neighbors.
  2. determination of the cue's scope: a second binary SVM classifier determines whether words are inside (IS) or outside (O) the scope of the cue. The classifier uses a much more comprehensive feature set which also includes structural features (i.e. dependency relations) for this task.

Evaluation

The evaluation has been performed on the Simon Fraser University (SFU) Review Corpus which comprises 17,263 sentences reviewing books, cars, computers, cookware, hotels, movies, music and phones.

  • evaluation metrics: P, R, F1, Geometric Mean $$\text{G-Mean} = \sqrt{\text{sensitivity}\cdot\text{specificity}}$$ with $$\text{specificity} = \frac{\text{true negatives}}{\text{negatives}}$$
  • definition of correctly classified scopes: (a) scope is correct if all tokens have been correctly classified as inside or outside of the scope (PCS), and (b) the percentage of correct relaxed scopes, i.e. $$\text{PCRS} = \frac{\text{correct spans}}{\text{total spans}}$$
The introduced methods obtains an F measure of 88% (92%) for detecting negation (speculation) cues and a PCS of 23% (14%) for detecting scopes. The corresponding PCRS measure amounts to 58% (45%) for negation (speculation).

A case study that incorporates the improved negation and speculation detection into the SO-CAL system (speculative statements are treated as neutral) shows considerably improved results, especially for negative statements.