Krovetz, R. et al. 2011. The web is not a person, Berners-Lee is not an organization, and African-Americans are not locations: an analysis of the performance of named-entity recognition. Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World.
This paper presents an evaluation of named-entity recognition that is based on
- the agreement rate between three different NER-systems, and
- the number of identified ambiguous entries (i.e. entities that were assigned to different NRE-classes (e.g. PERSON and ORGANIZATION) in the same document.
- NER taggers
- Stanford tagger
- LBJ tagger
- IdentiFinder (proprietary)
- Language resources used for training
- English Gigawords corpus
- Reuters 1996 new corpus
- North American News corpus
- Other language resources
- ETS SourceFinder corpus
- American national Corpus (ANC) - http://www.anc.org/annotations.html (tagged with NE)
Suggestions and Remarks
- the authors hypothesize that it is unlikely for two ambiguous words (same word but different NE class) to appear in the same document.
- they suggest to use grammar patterns such as "Bank of [LOCATION]" versus "[LOCATION]" to distinguish different NE classes.