IdentityRank: Named entity disambiguation in the news domain

2 minute read

Fernández, N. et al., 2012. IdentityRank: Named entity disambiguation in the news domain. Expert Systems with Applications, 39(10), pp.9207—9221.

Introduction

This article introduces a supervised algorithm for disambiguating named entities in the news domain and is able to process streams of news items taking advantage of metadata such as time stamps as context information.

Design Principles

  • Semantic coherence: related instances typically occur together. A mention of a company increases the probability of its CEO to occur in the same article. Entities also frequently have a higher probability to occur in a certain topic (e.g. politics, sports, etc.).
  • Temporal coherence: relevant events (e.g. Olympics, summits, etc.) are likely to generate several news items describing them (and therefore sharing the same context).

Method

Definition:

  • Page rank: "a page has a high page rank if the sum of the ranks of the pages that link to it is high".
  • IdRank: "an entity has a high rank if the sum of the ranks in the news items of the entities that typically co-occur with it is high"
The introduced IdRank algorithm processes the following information sources for disambiguation:

  1. co-occurrence of entities in the same message (semantic coherence)
  2. historical information on the probability of entities to occur in certain topics (semantic coherence)
  3. temporal information (temporal coherence)
$$C_m$$ represents the set of candidate instances for a certain name $$m \in M$$. $$C$$ represents the set of candidate instances for the entire document:

\[ C = \bigcup_{\forall m\in M} C_m\]

The probability that two instances (=entities $$I_i, I_j$$) co-occur together is defined as

\[P(I_i|I_j) = \frac{ndocs(I_i, I_j)}{ndocs(I_j)} \]

$$ndocs(I_i, I_j)$$ refers to the number of previously processed news items that have been annotated with both instances ($$I_i, I_j$$). Combining these two definitions yields the following relevance indicator $$R(I_i)$$.

\[R(I_i) = \sum_{j=1}^N a_{ij} R(I_j)\]

with

\[ a_{ij} = \begin{cases}P(I_i|I_j) & \text{if } i \neq j \text{ and } ndocs(I_j) \neq 0\\ 0 & \text{otherwise}\end{cases}\]

The corresponding matrix representation is

\[R = AR\]

Extensions

The authors extend the equation above with terms considering categorical information and temporal information yielding

\[ R=\alpha E_n + (1-\alpha)[k_{\alpha}AR + k_{kat}E_{kat} + k_{tim}E_{tim}] \]

Evaluation

Corpora:

  • WePS (http://nlp.uned.es/weps) from the Knowledge Base Population track of Text Analysis Conference.
  • NYT (New York Times) corpus available for purchase at the Linguistic Data Consortium (LDC)

Metrics

  • the use accuracy defined as the proportion of decisions which are correct with respect to the overall number of decisions
  • a correct decision is defined as a decision where the correct entity has been chosen for an ambiguous name (neglecting potentially irrelevant entities that have been detected as well).