Open Information Extraction using Wikipedia

3 minute read

Wu, F. & Weld, D.S., 2010. Open information extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. ACL ™2010. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 118—127.

This article discusses an open information extraction system that uses heuristic matches between a Wikipedia pages (subject) and their infobox attribute values (object) with the corresponding sentences for generating training data.

1. Introduction

Most open IE systems use self-supervised learning, i.e. heuristics for generating training examples. TextRunner, for example, uses a small set of handwritten rules to label examples from sentences in the Penn Treebank.

They then use these rules to extract one (subject, relation, object) triple for every relation stated explicitly in the text.

2. Method

2.1. Preprocessing

  • The text is split into sentences and afterwards preprocessed using either (i) the Stanford dependency parser (WOE_parser) or (ii) part-of-speech tags only (WOE_pos).
  • The preprocessor builds synonym sets (e.g. University of Washington = UW, ...) by drawing upon
    • Wikipedia redirection pages (e.g. USA -> United States) and
    • backlinks (i.e. the anchor text of links pointing to an article)

2.2. Matcher

  • The matcher constructs training data by searching Wikipedia pages for sentences that contain a reference to the subject (= the current Wikipedia page) and the object (= a Wikipedia infobox attribute value) in the same sentence. Since parsing Wikipedia infoboxes is quite complex, the authors draw upon a cleaned set of infoboxes from >1E6 articles provided by DBpedia.
  • Rules for identifying subjects (ordered by their weight descending); if multiple subjects match, the match with the highest weight is selected.
    1. Full match
    2. Synonym set match
    3. Partial match (prefix or suffix of the entity's name; If the full name contains punctuation, only a prefix is allowed (For example: "Amherst" matches "Amherst, Mass" but not "Mass")
    4. Patterns of the type "the {entity type}" (e.g. "the city", "the university", ...). The matcher extracts the type based on extraction patterns since most Wikipedia articles start with sentences such as ("The city of ...", "The University of ...")
    5. The most frequent pronoun. The matcher assumes that the most frequent pronoun (e.g. "he", "she", or "it") refers to the page's subject.

  • Rules for matching sentences
    1. attributes matches by multiple sentences are skipped
    2. subject and/or object are required to be the heads of the noun phrase containing them.
    3. subject and object need to appear in the same clause (or in a parent/child clause) in the parse tree.

2.3. Learning Extractors

  • WOE_parser uses the Stanford parser using the collapsedDependency format. Based on the parser's output the authors create a directed graph in which they use the shortest connecting path (= core path) between subject and object to represent the relation. To capture a phrase's meaning the create an expanded path that also contains adverbial and adjectival modifiers as well as dependencies like "neg" (negation) and "auxpass" (passive).
  • Relevant patterns are identified based on their normalized logarithmic frequency (a higher frequency yields a more relevant pattern) \[ rel(p) = \frac{ max(log(f_p) - log(f_{min}), 0)} { log(f_{max}) - log(f_{min})}\] where $$log(f_p)$$ corresponds to the pattern's frequency, $$log(f_{max})$$ to the frequency of the mostly used pattern in the database, and $$log(f_{min})$$ to a user selected minimal threshold required for considering the pattern.
  • WOE_pos relies on POS patterns only and, therefore, runs approximately 30 times faster the WOE_parser. It uses the same learning algorithm as TextRunner. Positive phrases are generated by the matcher, negative examples from random noun-phrase pairs in other sentences.

3. Evaluation

The evaluation uses 300 randomly selected sentences from each of the following corpora: (i) the WSJ from Penn Treebank, (ii) Wikipedia and (iii) the general Web. Each sentences was examined by two people to label all reasonable triples. Afterwards, they were mixed with pseudo-negative ones and submitted to Amazon Mechanical Turk for verification.

4. Conclusion

  • WOE yiels an 18-34% improvement over TextRunner using POS tags and a 72-91% improvement using dependency-parse features
  • although Jiang and Zhai (2007) conclude that parser-based features are unnecessary for IE (given the availability of lexical features), the abstract dependency paths provided by the Stanford tagger seems to be highly informative for methods in which no lexical features are used.