WISE 2011 - Training a Named Entity Recognizer on the Web

1 minute read

by Urbansky et al.

The authors distinguish between three approaches towards NER:

use of hand-crafted rules (lexicons, rules)
supervised machine learning, and
unsupervised machine learning (e.g., clustering)

Creating Training Data

Collins and Singer show that at least seven seed terms are required for creating well performing extraction rules. A higher number of seeds increases recall at the cost of the component's precision. The authors apply the following process for creating the training data:

create a number of seed entities
query a search engine to retrieve documents containing these seeds
boilerplate removal
annotate the seeds and remove all lines that either
1. do not contain any annotated seed
2. are shorter than 80 characters, or
3. do not contain any context around the seed

build the knowledge base:
1. create a dictionary of n-grams and their probability of refering to a certain type (e.g. "r. John" -> Person (0.9), Location (0.05), Product (0.05), ...)
2. complete context information per type
3. context information within a sliding window of (+3/-3 words).
4. case signatures for all tokens in the training data (A => completely uppercase, Aa => capitalized, a => lowercase)

When analyzing contexts, numbers are always replaced by the placeholder "NUM".

Named Entity Recognition

Entity Detection: use regular expressions (capitalization for English documents) to identify potential entities.
Classify the entities based on the knowledge base by creating n-grams (3<=n<=7) and comparing them to the data stored in the dictionaries.
Post processing
1. remove date fragments (since dates are not considered named entities) - e.g. July John Hiat => John Hiat
2. apply information from the context dictionary to refine the estimation (e.g. "Paris" -> Person; "born in Paris" -> City)
3. use context information to change the boundaries of n-grams (e.g. "President Obama" -> "Obama" since president has been used multiple times in the text without Obama which suggests that it is some kind of title or job description).

Resources:

Areca - a repository for test data sets
Palladian - an NLP framework that is free for research purposes
CoNLL dataset

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

WISE 2011 - Training a Named Entity Recognizer on the Web

Creating Training Data

Named Entity Recognition

Resources:

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers