Identifying Relations for Open Information Extraction
by Fader et al.
This paper addresses two major shortcomings of state of the art open information extraction systems:
- uninformative extractions that omit critical information "Faust made a deal with the devil" -> "Faust" - "made" - "deal"
- incoherent extractions that yield phrase with no meaningful interpretation "The guide contains dead links and omits sites" -> "contains omits" "The Mark 14 was central to the torpedo scandal of the fleet" -> "was central torpedo"
Important Definitions
- information extraction systems - learn an extractor per target relation and therefore do not scale to well
- open information extraction - identify relational phrases by
- labeling sentences using heuristics or supervision
- learning phrases based on the training examples (TextRunner needs approximately 200,000 heuristically labeled sentences)
- extracting data based on the learned model
- light verb constructions (LVC) are multi-word expressions that are composed of a verb and a noun, with the noun carrying the semantic content of the predicate
examples:
- is -> is an album my, is the author of, is a city in
- has -> has a population of, has a Ph.D. in, ...
- learning of selectional preferences (Ritter et al., 2010)
- acquiring common sense knowledge (Lin et al., 2010)
- recognizing entailment (Schoenmackers et al., 2010; Berant et al., 2011)
- mapping onto existing ontologies (Soderland et al., 2010)
- require phrases to match POS tag patterns
- multiple possible matches => the longest is chosen
- prevent over-specified relation phrases such as "is offering only ...." by requiring phrases to appear multiple times in the corpus
- Precision-Recall curve