# Targeted disambiguation of ad-hoc, homogeneous sets of named entities

Wang, C., Chakrabarti, K., Cheng, T., & Chaudhuri, S. (2012). Targeted disambiguation of ad-hoc, homogeneous sets of named entities. In Proceedings of the 21st international conference on World Wide Web (pp. 719—728). New York, USA

## Introduction

Most prior work on named entity linking uses techniques that focus on entities present in a knowledge base and therefore (i) require background knowledge on the entity, and (ii) are biased towards pages that are similar to the entity metadata.

1. ad-hoc: no context information is required (compare: according to shoe.com Wikipedia mentions only 82 out of 900 shoe brands)
2. homogeneous: documents to link are in a homogeneous domain (e.g. IT coverage, business news, ...)

## Method

MentionRank leverages the following three properties of homogeneous domains:

1. context similarity: the context of true mentions is more similar than between false mentions (although some false mentions may occur in a similar context as well; e.g. the newspaper Sun, ...)
2. co-mentions: multiple entities names mentioned in one document are more likely to be true mentions
3. cross-document, cross entity interdependence: if a mention has a similar context with many true mentions, it is likely to be a true mention

The authors model these properties in a graph-based model which has been inspired by other graph-based ranking methods such as PageRank:

1. nodes are candidate mentions $$(e_i, d_i)$$ with
2. an associated estimated prior ranking score $$\mu_{ij}$$ which is determined based on the number of co-mentions in the document ($$d_j$$), and
3. an associated ranking score $$r_{ij}$$ that also considers the score of correlated mentions.
4. edges use the context similarity $$\nu_{ij, i\prime j\prime}$$ as weight to connect the nodes in the model. The context similarity is computed using a vector space model with normalized vectors and the tf-idf cosine similarity.

Propagation:

The authors normalize the weights $$\nu$$ and apply the following two restrictions:

1. unlinking - disallow the propagation between candidate mentions of the same entity (to prevent negative effects of context similarities between false mentions), and
2. normalization - limit the total contribution of an individual entity.

The problem of computing MentionRank can then be rewritten as r = Mr with the ranking score vector r and the stochastic, irreducible and aperiodic Markov matrix M.

## Evaluation:

The evaluation uses datasets from the following three domains: programming languages, science fiction books and sloan fellows, only considering documents with candidate mentions. The authors use mean-average-precision (MAP) as a performance metric and achieve a MAP of 0.65 for a low number of co-disambiguated entities, and values that reach as high as 90% for a high (>=40) number of co-disambiguated entities.

Tags:

Categories:

Updated: