Automatic knowledge extraction from documents

2 minute read

Fan, J., Kalyanpur, A., Gondek, D. C., & Ferrucci, D. A. (2012). Automatic knowledge extraction from documents. IBM Journal of Research and Development, 56(3.4), 5:1—5:10. doi:10.1147/JRD.2012.2186519


This is another article from the IBM Watson team that describes how they use open relation extraction to obtain knowledge on instances and classes from large text corpora.


  1. Frame ($$f_i$$): the basic semantic unit that consists of slots (binary relations) and the corresponding value pairs. A frame represents a set of entities and their relations in a piece of text. $$f_i = \{(s_1, v_1), (s_2, v_2), ... (s_n, v_m)\}$$ Example: <(verb, receive), (subj, Einstein), (obj, Nobel prize, ...)
  2. Slot ($$s_i$$): a binary relation such as (awarded, $$v_i$$).
  3. Slot value ($$v_i$$): the lemma form of an extracted term (e.g. žNovel Prize for the slot žawarded)
  4. Frame projection: a portion of a frame, that occurs in many frames that is then used to derive class information from these frames (e.g. scientists publish, ...)


PRISMATIC uses a three step knowledge extraction method:

  1. Corpus processing - all sentences are annotated using the English Slot Grammar (ESG) dependency parser, a simple rule based co-reference resolution (McCord et al, 2012), and a rule-based named entity recognition to assign entity types using the isA slot.
  2. Frame extraction - Frames are extracted based on the dependency parses and associated annotations. Only relationships which represent participant information of a predicate are extracted and each frame is restricted to two levels deep at the most
  3. Frame projection - provide an aggregated view of selected parts of the frames (e.g. subject-verb-object or verb-objectType) to extract general (class) knowledge. The following aggregated statistics are provided by frame projections:
    • Frequency (the number of frames whose slot values match) - popularity measure
    • Conditional probability - shows how probable a particular instance is (e.g. for subject-verb-object - how often žEinstein wins the nobel prize if he wins something)
    • Normalized pointwise mutal information - >shows the degree of co-occurrence between frames. $$npmi(f, f') = \frac{pmi(f, f')}{-ln\frac{(max[\#(f),\#(f')])}{N}}$$ with [-1, 1]

Interesting resources:

  1. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1 (pp. 86—90). Stroudsburg, PA, USA: Association for Computational Linguistics. (a lexical database for the frame structure of selected words)
  2. McCord, M. C., Murdock, J. W., & Boguraev, B. K. (2012). Deep parsing in Watson. IBM J. Res. Dev., 56(3), 264—278. doi:10.1147/JRD.2012.2185409 (Co-reference resolution, rules-based named entity recognition and dependency parsing)
  3. Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press.
  4. Schuler, K. K. (2006). VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. University of Pennsylvania. (VerbNet - maps verbs to their corresponding Levin verb classes)