Improving Efficiency and Accuracy in Multilingual Entity Extraction
Introduction
Method
Spotlight uses a two step approach towards extracting entities: (i) phrase spotting (i.e. recognition of phrases to be annotated) and (ii) disambiguation (i.e. entity linking against DBpedia).Phrase Spotting
- identify candidate phrases using one of the following two approaches:
- language-independent (lexical): substring-matching (Aho-Corasick algorithm)
- language-dependent: extract (i) capitalized tokens, (ii) noun phrases, prepositional phrases and multi word units (MWU), and (iii) name entities
- candidate selection: resolve overlaps based on a preference-based choice (NE > ORG > LOC > MISC > NP > MWU > PP > lexical lookup > Capitalized Sequence)
Disambiguation
Based on a probabilistic model from Han and Sun (2011) which uses raw counts of phrases (s), contexts (c) and the entities (e) from Wikipedia for computing the probabilities P(e), P(s|e), P(c|e). Phrases with a lower probability than an artificial NIL entry are removed.Technologies Used
- fast serialization: kryo
- substring search: LingPipe's Aho-Corasick implementation
- tokenization and named entity recognition: Apache OpenNLP