# Automatic knowledge extraction from documents

Fan, J., Kalyanpur, A., Gondek, D. C., & Ferrucci, D. A. (2012). Automatic knowledge extraction from documents. IBM Journal of Research and Development, 56(3.4), 5:1—5:10. doi:10.1147/JRD.2012.2186519

## Introduction

This is another article from the IBM Watson team that describes how they use open relation extraction to obtain knowledge on instances and classes from large text corpora.

## Terminology:

1. Frame ($f_i$): the basic semantic unit that consists of slots (binary relations) and the corresponding value pairs. A frame represents a set of entities and their relations in a piece of text. $f_i = \{(s_1, v_1), (s_2, v_2), ... (s_n, v_m)\}$ Example: <(verb, receive), (subj, Einstein), (obj, Nobel prize, ...)
2. Slot ($s_i$): a binary relation such as (awarded, $v_i$).
3. Slot value ($v_i$): the lemma form of an extracted term (e.g. žNovel Prize for the slot žawarded)
4. Frame projection: a portion of a frame, that occurs in many frames that is then used to derive class information from these frames (e.g. scientists publish, ...)

## Method

PRISMATIC uses a three step knowledge extraction method:

1. Corpus processing - all sentences are annotated using the English Slot Grammar (ESG) dependency parser, a simple rule based co-reference resolution (McCord et al, 2012), and a rule-based named entity recognition to assign entity types using the isA slot.
2. Frame extraction - Frames are extracted based on the dependency parses and associated annotations. Only relationships which represent participant information of a predicate are extracted and each frame is restricted to two levels deep at the most
3. Frame projection - provide an aggregated view of selected parts of the frames (e.g. subject-verb-object or verb-objectType) to extract general (class) knowledge. The following aggregated statistics are provided by frame projections:
• Frequency (the number of frames whose slot values match) - popularity measure
• Conditional probability - shows how probable a particular instance is (e.g. for subject-verb-object - how often žEinstein wins the nobel prize if he wins something)
• Normalized pointwise mutal information - >shows the degree of co-occurrence between frames. $npmi(f, f') = \frac{pmi(f, f')}{-ln\frac{(max[\#(f),\#(f')])}{N}}$ with [-1, 1]

## Interesting resources:

1. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1 (pp. 86—90). Stroudsburg, PA, USA: Association for Computational Linguistics. (a lexical database for the frame structure of selected words)
2. McCord, M. C., Murdock, J. W., & Boguraev, B. K. (2012). Deep parsing in Watson. IBM J. Res. Dev., 56(3), 264—278. doi:10.1147/JRD.2012.2185409 (Co-reference resolution, rules-based named entity recognition and dependency parsing)
3. Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press.
4. Schuler, K. K. (2006). VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. University of Pennsylvania. (VerbNet - maps verbs to their corresponding Levin verb classes)

