Ensemble Semantics for Large-scale Unsupervised Relation Extraction

1 minute read

Min, B. et al., 2012. Ensemble semantics for large-scale unsupervised relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. EMNLP-CoNLL 2012. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1027—1037.

This article introduces an unsupervised algorithm for open relation extraction that contains means to treat the polysemy and synonymy problems.

Method

The author map relation instances <ent_1, ctx, ent_2> to the corresponding relations $$P(C_1, C_2)$$.

  • Polysemy occurs if one relational phrase (ctx) is used to express multiple relation types P.
    • "Graubünden is a part of Switzerland" - isLocatedIn(state, country) versus "easybank is a subsidiary of the BAWAG PSK Group" - subUnit(company, company)
    • "The Euro is the currency of Germany" - isCurrency(Euro, Germany) versus "authorship is the currency of science" - successFactor(authorship, science)
  • Synonymy refers to the fact, that different relations phrases (ctx) refer to the same relation type P:
    • The Euro is the currency used in Germany. The Australian Dollar is legal tender in Western Australia.
The authors address the problems of polysemy by clustering the arguments used in relational phrases using the following semantic resources:
  1. Entity similarity graphs
    • Distributional similarity draws upon the distributional hypothesis (Harris, 1985) that states that similar terms share similar context. The authors use a text windows (size=4) to retrieve context terms and PMI to weight the context features. Then the Jaccard similarity is used to determine the similarity of two concepts.
    • Pattern similarity identifies semantically similar lexical patterns (e.g. "(such as|including)| T,{,T}* (and|, |.)") or HTML tag patterns.
  2. Hypernym graphs identify common hypernyms (e.g. Lainach, Flims --> village)
  3. Relation phrase similarity generates a pairwise similarity graph for relation phrases that indicates the probability of two phrases expressing the same relation. The authors apply a variant of the DIRT algorithm (Lin and Pantel, 2001) that uses stemmed lexical sequences (relation phrases) instead of dependency paths and ordered pairs of argument features.
The relation list obtained from the clustering (no polysemy but still synonymy) is then again clustered to merge similar relations (i.e. synonyms) to relations that neither contain polysems nor synonyms.

Bibliography

  1. Lin, D. & Pantel, P., 2001. DIRT - discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 2001). New York, NY, USA: ACM, pp. 323—328.
  2. The ClueWeb09 Dataset (Evaluation Corpus)