A Machine-Learning Approach to Negation and Speculation Detection for Sentiment Analysis
Summary
Method
Since negation (speculation) appears only in 18% (23%) of the sentences, the system needs to address a classification problem on imbalanced data sets in which the algorithm is biased toward the majority class. Therefore, the authors draw upon a cost-sensitive SVM algorithm that assigns considerably higher cost to missclassifying a minory-class example to control the skew of the SVM optimization. As kernel, a Radial Basic Function (RBF) is used. The classifier's feature set has been determined with a greedy forward procedure.- identification of negation and speculation cues: a support vector machine is trained to predict whether a term is located at the beginning of a cue (B), inside a cue (I) or outside of it (O). The corresponding BIO model allows detecting multiword cues such as can't or does not. Applying a postprocessing heuristic which sets the tag of the first word of a cue to "B" and ensures that "n't" obtains an "I" tag considerably improves the accuracy of the presented approach. The features set for this task is limited to lexical information on the term and its neighbors.
- determination of the cue's scope: a second binary SVM classifier determines whether words are inside (IS) or outside (O) the scope of the cue. The classifier uses a much more comprehensive feature set which also includes structural features (i.e. dependency relations) for this task.
Evaluation
- evaluation metrics: P, R, F1, Geometric Mean $$\text{G-Mean} = \sqrt{\text{sensitivity}\cdot\text{specificity}}$$ with $$\text{specificity} = \frac{\text{true negatives}}{\text{negatives}}$$
- definition of correctly classified scopes: (a) scope is correct if all tokens have been correctly classified as inside or outside of the scope (PCS), and (b) the percentage of correct relaxed scopes, i.e. $$\text{PCRS} = \frac{\text{correct spans}}{\text{total spans}}$$
A case study that incorporates the improved negation and speculation detection into the SO-CAL system (speculative statements are treated as neutral) shows considerably improved results, especially for negative statements.