WISE 2011 - Training a Named Entity Recognizer on the Web

1 minute read

by Urbansky et al.

The authors distinguish between three approaches towards NER:

  1. use of hand-crafted rules (lexicons, rules)
  2. supervised machine learning, and
  3. unsupervised machine learning (e.g., clustering)

Creating Training Data

Collins and Singer show that at least seven seed terms are required for creating well performing extraction rules. A higher number of seeds increases recall at the cost of the component's precision. The authors apply the following process for creating the training data:

  1. create a number of seed entities
  2. query a search engine to retrieve documents containing these seeds
  3. boilerplate removal
  4. annotate the seeds and remove all lines that either
    1. do not contain any annotated seed
    2. are shorter than 80 characters, or
    3. do not contain any context around the seed
  5. build the knowledge base:
    1. create a dictionary of n-grams and their probability of refering to a certain type (e.g. "r. John" -> Person (0.9), Location (0.05), Product (0.05), ...)
    2. complete context information per type
    3. context information within a sliding window of (+3/-3 words).
    4. case signatures for all tokens in the training data (A => completely uppercase, Aa => capitalized, a => lowercase)
When analyzing contexts, numbers are always replaced by the placeholder "NUM".

Named Entity Recognition

  1. Entity Detection: use regular expressions (capitalization for English documents) to identify potential entities.
  2. Classify the entities based on the knowledge base by creating n-grams (3<=n<=7) and comparing them to the data stored in the dictionaries.
  3. Post processing
    1. remove date fragments (since dates are not considered named entities) - e.g. July John Hiat => John Hiat
    2. apply information from the context dictionary to refine the estimation (e.g. "Paris" -> Person; "born in Paris" -> City)
    3. use context information to change the boundaries of n-grams (e.g. "President Obama" -> "Obama" since president has been used multiple times in the text without Obama which suggests that it is some kind of title or job description).