# IdentityRank: Named entity disambiguation in the news domain

FernÃ¡ndez, N. et al., 2012. IdentityRank: Named entity disambiguation in the news domain. Expert Systems with Applications, 39(10), pp.9207—9221.

# Introduction

This article introduces a supervised algorithm for disambiguating named entities in the news domain and is able to process streams of news items taking advantage of metadata such as time stamps as context information.

# Design Principles

• Semantic coherence: related instances typically occur together. A mention of a company increases the probability of its CEO to occur in the same article. Entities also frequently have a higher probability to occur in a certain topic (e.g. politics, sports, etc.).
• Temporal coherence: relevant events (e.g. Olympics, summits, etc.) are likely to generate several news items describing them (and therefore sharing the same context).

# Method

Definition:

• Page rank: "a page has a high page rank if the sum of the ranks of the pages that link to it is high".
• IdRank: "an entity has a high rank if the sum of the ranks in the news items of the entities that typically co-occur with it is high"
The introduced IdRank algorithm processes the following information sources for disambiguation:

1. co-occurrence of entities in the same message (semantic coherence)
2. historical information on the probability of entities to occur in certain topics (semantic coherence)
3. temporal information (temporal coherence)
$$C_m$$ represents the set of candidate instances for a certain name $$m \in M$$. $$C$$ represents the set of candidate instances for the entire document:

$C = \bigcup_{\forall m\in M} C_m$

The probability that two instances (=entities $$I_i, I_j$$) co-occur together is defined as

$P(I_i|I_j) = \frac{ndocs(I_i, I_j)}{ndocs(I_j)}$

$$ndocs(I_i, I_j)$$ refers to the number of previously processed news items that have been annotated with both instances ($$I_i, I_j$$). Combining these two definitions yields the following relevance indicator $$R(I_i)$$.

$R(I_i) = \sum_{j=1}^N a_{ij} R(I_j)$

with

$a_{ij} = \begin{cases}P(I_i|I_j) & \text{if } i \neq j \text{ and } ndocs(I_j) \neq 0\\ 0 & \text{otherwise}\end{cases}$

The corresponding matrix representation is

$R = AR$

## Extensions

The authors extend the equation above with terms considering categorical information and temporal information yielding

$R=\alpha E_n + (1-\alpha)[k_{\alpha}AR + k_{kat}E_{kat} + k_{tim}E_{tim}]$

# Evaluation

## Corpora:

• WePS (http://nlp.uned.es/weps) from the Knowledge Base Population track of Text Analysis Conference.
• NYT (New York Times) corpus available for purchase at the Linguistic Data Consortium (LDC)

## Metrics

• the use accuracy defined as the proportion of decisions which are correct with respect to the overall number of decisions
• a correct decision is defined as a decision where the correct entity has been chosen for an ambiguous name (neglecting potentially irrelevant entities that have been detected as well).

Tags:

Categories:

Updated: