Targeted disambiguation of ad-hoc, homogeneous sets of named entities

2 minute read

Wang, C., Chakrabarti, K., Cheng, T., & Chaudhuri, S. (2012). Targeted disambiguation of ad-hoc, homogeneous sets of named entities. In Proceedings of the 21st international conference on World Wide Web (pp. 719—728). New York, USA

Introduction

Most prior work on named entity linking uses techniques that focus on entities present in a knowledge base and therefore (i) require background knowledge on the entity, and (ii) are biased towards pages that are similar to the entity metadata.

This article introduces MentionRank, a method that targets ad-hoc entities in homogeneous domains.

  1. ad-hoc: no context information is required (compare: according to shoe.com Wikipedia mentions only 82 out of 900 shoe brands)
  2. homogeneous: documents to link are in a homogeneous domain (e.g. IT coverage, business news, ...)

Method

MentionRank leverages the following three properties of homogeneous domains:

  1. context similarity: the context of true mentions is more similar than between false mentions (although some false mentions may occur in a similar context as well; e.g. the newspaper Sun, ...)
  2. co-mentions: multiple entities names mentioned in one document are more likely to be true mentions
  3. cross-document, cross entity interdependence: if a mention has a similar context with many true mentions, it is likely to be a true mention

The authors model these properties in a graph-based model which has been inspired by other graph-based ranking methods such as PageRank:

  1. nodes are candidate mentions $$(e_i, d_i)$$ with
  2. an associated estimated prior ranking score $$\mu_{ij}$$ which is determined based on the number of co-mentions in the document ($$d_j$$), and
  3. an associated ranking score $$r_{ij}$$ that also considers the score of correlated mentions.
  4. edges use the context similarity $$\nu_{ij, i\prime j\prime}$$ as weight to connect the nodes in the model. The context similarity is computed using a vector space model with normalized vectors and the tf-idf cosine similarity.

Propagation:

The authors normalize the weights $$\nu$$ and apply the following two restrictions:

  1. unlinking - disallow the propagation between candidate mentions of the same entity (to prevent negative effects of context similarities between false mentions), and
  2. normalization - limit the total contribution of an individual entity.

The problem of computing MentionRank can then be rewritten as r = Mr with the ranking score vector r and the stochastic, irreducible and aperiodic Markov matrix M.

Evaluation:

The evaluation uses datasets from the following three domains: programming languages, science fiction books and sloan fellows, only considering documents with candidate mentions. The authors use mean-average-precision (MAP) as a performance metric and achieve a MAP of 0.65 for a low number of co-disambiguated entities, and values that reach as high as 90% for a high (>=40) number of co-disambiguated entities.