The Web is not a Person - An Analysis of the Performance of Named-Entity Recognition

less than 1 minute read

Krovetz, R. et al. 2011. The web is not a person, Berners-Lee is not an organization, and African-Americans are not locations: an analysis of the performance of named-entity recognition. Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World.

This paper presents an evaluation of named-entity recognition that is based on

  1. the agreement rate between three different NER-systems, and
  2. the number of identified ambiguous entries (i.e. entities that were assigned to different NRE-classes (e.g. PERSON and ORGANIZATION) in the same document.
The authors show that although literature reports an accuracy of 85-95% for named entity recognition the agreement rates between the classifiers and identified ambiguities suggest a much lower accuracy.

Resources

  • NER taggers
    • Stanford tagger
    • LBJ tagger
    • IdentiFinder (proprietary)
  • Language resources used for training
    • English Gigawords corpus
    • Reuters 1996 new corpus
    • North American News corpus
  • Other language resources
    • ETS SourceFinder corpus
    • American national Corpus (ANC) - http://www.anc.org/annotations.html (tagged with NE)

Suggestions and Remarks

  1. the authors hypothesize that it is unlikely for two ambiguous words (same word but different NE class) to appear in the same document.
  2. they suggest to use grammar patterns such as "Bank of [LOCATION]" versus "[LOCATION]" to distinguish different NE classes.