Relation Extraction and the Influence of Automatic Named-Entity Recognition
Introduction
Relation extraction aims at identifying (directed) (binary) relations $$R_{ij} = (E_i, E_j) := R_{ji} = (E_j, E_i)$$ in text documents. This article introduces an approach that uses kernel functions to integrate information from (i) the sentence where the relation appears, and (ii) the local context around the interacting entities.Method
The authors treat relation extraction as a classification task that distinguishes the following classes:- correct: locatedIn(Chur, Switzerland) -> 2
- correct, but incorrect direction: locatedIn(Switzerland, Chur)-> 1
- incorrect: locatedIn(Chur, St. Gallen) -> 0
- wrong entity types: locatedIn(Christian Toth, Chur) -> -1
\[ K_{SL}(R_1,R_2) = K_{GC}(R_1,R_2) + K_{LC}(R_1,R_2). \]
The method was implemented using the LibSVM package.
Global Context Kernel
Bunesco and Mooney (2005) observe that relations between entities are usually expressed in one of the following contexts:- Fore-Between (FB): e.g. "the head of [org], Dr [per]"
- Between (B): e.g. "[org] spokesman [per]"
- Between-After (BA): e.g. "[per], a [org] law professor"
\[ K_{GC}(R_1, R_2) = K_{FB}(R_1, R_2) + K_{B}(R_1, R_2) + K_{BA}(R_1, R_2)\]
Local Context Kernel
The local context often provides clues for (i) the presence of a relation and (ii) its direction. The authors represent each local context by using the following basic features considering the ordering of tokens- Token
- Lemma of the token
- POS-Tag of the token
- Stem of the token
- Orthographic, a function that maps tokens into equivalence classes such as capitalization, punctuation and numerals.
\[ K_{LC}(R_1, R_2) = K_{left}(R_1, R_2) + K_{right}(R_1, R_2)\]
Evaluation
The authors performed a 5-fold cross-validation with the dataset used by Roth and Yin (2007) that is based on the TREC 2004 corpus considering the following relation types: locatedIn, workFor, orgBasedIn, liveIn, kill, and noRel yielding- F1 values between 71 and 82% for gold-standard named entities (=> all named entities are known), and
- F1 values between 69 and 81% without the gold-standard named entities. The evaluation also discusses the impact of named entities introduced by an incorrect NER (spurious named entities) and of missing named entities.
Bibliography
-
Bunescu, R.C. & Mooney, R.J., 2005. Subsequence Kernels for Relation Extraction. In 19th Conference on Neural Information Processing Systems (NIPS™05). Vancouveer, British Columbia, Canada.
- Roth, D. & Yih, W., 2004. A Linear Programming Formulation for Global Inference in Natural Language Tasks. In H. T. Ng & E. Riloff, eds. 8th Conference on Computational Natural Language Learning (CoNLL 2004). Association for Computational Linguistics, pp. 1—8.