Yao, L., Riedel, S. & McCallum, A., 2010. Collective cross-document relation extraction without labelled data. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. EMNLP ™10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1013—1023.
Introduction
This article introduces an approach for the extraction of relations from document collections that does not rely on pre-labelled data but uses an existing knowledge base instead.
Method
The presented method
- uses existing knowledge from Freebase for automatically creating training examples. Such an approach is known as distant supervision, self training or weak supervision. Distant supervision often leads to noisy and incorrect patterns which can partially be countered by enforcing constraints (such as entity types).
- obtains and pools data from the whole document collection rather than from single sentences and use constraints to further improve its accuracy.
- uses an undirected graphical model (Conditional Random Field (CRF)) in which variables correspond to facts and factors between them measure compatibility (i.e. whether the given data violates any constraints) using the following kinds of factor templates:
- Bias templates - prefer certain relations over others (e.g. based on their a priori probability).
- Mention templates - connect mentions (i.e. entities) with relations considering features such as lexical content, and the syntactic path between the mentions.
- Selectional preferences templates - ensure the correlations between entity types and relations (i.e. the entity type constraints)
- relies on a Gibbs-Sampler (SampleRank) at inference time which leads to a linear runtime behavior. SampleRank performs parameter updates within Markov chain Monte Carlo (MCMC) inference.
Evaluation
- in-domain setting (Freebase is partially derived from Wikipedia): discover relations from Wikipedia
- out-of-domain setting: extract relations on a New York Times corpus. The evaluation is performed on the top 50 extracted relation instances. Each relation is verified by three annotators and a majority vote is used upon disagreement. The authors observed precision@50 values between 0.42 and 0.98 depending on the relation type.