Targeted disambiguation of ad-hoc, homogeneous sets of named entities

2 minute read

Wang, C., Chakrabarti, K., Cheng, T., & Chaudhuri, S. (2012). Targeted disambiguation of ad-hoc, homogeneous sets of named entities. In Proceedings of the 21st international conference on World Wide Web (pp. 719—728). New York, USA

Introduction

Most prior work on named entity linking uses techniques that focus on entities present in a knowledge base and therefore (i) require background knowledge on the entity, and (ii) are biased towards pages that are similar to the entity metadata.

This article introduces MentionRank, a method that targets ad-hoc entities in homogeneous domains.

ad-hoc: no context information is required (compare: according to shoe.com Wikipedia mentions only 82 out of 900 shoe brands)
homogeneous: documents to link are in a homogeneous domain (e.g. IT coverage, business news, ...)

Method

MentionRank leverages the following three properties of homogeneous domains:

context similarity: the context of true mentions is more similar than between false mentions (although some false mentions may occur in a similar context as well; e.g. the newspaper Sun, ...)
co-mentions: multiple entities names mentioned in one document are more likely to be true mentions
cross-document, cross entity interdependence: if a mention has a similar context with many true mentions, it is likely to be a true mention

The authors model these properties in a graph-based model which has been inspired by other graph-based ranking methods such as PageRank:

nodes are candidate mentions $$(e_i, d_i)$$ with
an associated estimated prior ranking score $$\mu_{ij}$$ which is determined based on the number of co-mentions in the document ($$d_j$$), and
an associated ranking score $$r_{ij}$$ that also considers the score of correlated mentions.
edges use the context similarity $$\nu_{ij, i\prime j\prime}$$ as weight to connect the nodes in the model. The context similarity is computed using a vector space model with normalized vectors and the tf-idf cosine similarity.

Propagation:

The authors normalize the weights $$\nu$$ and apply the following two restrictions:

unlinking - disallow the propagation between candidate mentions of the same entity (to prevent negative effects of context similarities between false mentions), and
normalization - limit the total contribution of an individual entity.

The problem of computing MentionRank can then be rewritten as r = Mr with the ranking score vector r and the stochastic, irreducible and aperiodic Markov matrix M.

Evaluation:

The evaluation uses datasets from the following three domains: programming languages, science fiction books and sloan fellows, only considering documents with candidate mentions. The authors use mean-average-precision (MAP) as a performance metric and achieve a MAP of 0.65 for a low number of co-disambiguated entities, and values that reach as high as 90% for a high (>=40) number of co-disambiguated entities.

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Targeted disambiguation of ad-hoc, homogeneous sets of named entities

Introduction

Method

Evaluation:

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers