WISE 2011 - Training a Named Entity Recognizer on the Web
by Urbansky et al.
The authors distinguish between three approaches towards NER:
- use of hand-crafted rules (lexicons, rules)
- supervised machine learning, and
- unsupervised machine learning (e.g., clustering)
Creating Training Data
Collins and Singer show that at least seven seed terms are required for creating well performing extraction rules. A higher number of seeds increases recall at the cost of the component's precision. The authors apply the following process for creating the training data:- create a number of seed entities
- query a search engine to retrieve documents containing these seeds
- boilerplate removal
- annotate the seeds and remove all lines that either
- do not contain any annotated seed
- are shorter than 80 characters, or
- do not contain any context around the seed
- build the knowledge base:
- create a dictionary of n-grams and their probability of refering to a certain type (e.g. "r. John" -> Person (0.9), Location (0.05), Product (0.05), ...)
- complete context information per type
- context information within a sliding window of (+3/-3 words).
- case signatures for all tokens in the training data (A => completely uppercase, Aa => capitalized, a => lowercase)
Named Entity Recognition
- Entity Detection: use regular expressions (capitalization for English documents) to identify potential entities.
- Classify the entities based on the knowledge base by creating n-grams (3<=n<=7) and comparing them to the data stored in the dictionaries.
- Post processing
- remove date fragments (since dates are not considered named entities) - e.g. July John Hiat => John Hiat
- apply information from the context dictionary to refine the estimation (e.g. "Paris" -> Person; "born in Paris" -> City)
- use context information to change the boundaries of n-grams (e.g. "President Obama" -> "Obama" since president has been used multiple times in the text without Obama which suggests that it is some kind of title or job description).
Resources:
- Areca - a repository for test data sets
- Palladian - an NLP framework that is free for research purposes
- CoNLL dataset