Daiber, J. et al., 2013. Improving Efficiency and Accuracy in Multilingual Entity Extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS ™13). pp. 121—124
Introduction
This paper focuses on the implementation and data processing challenges encountered while developing a faster and more accurate version of DBpedia Spotlight.
Method
Spotlight uses a two step approach towards extracting entities: (i) phrase spotting (i.e. recognition of phrases to be annotated) and (ii) disambiguation (i.e. entity linking against DBpedia).
Phrase Spotting
- identify candidate phrases using one of the following two approaches:
- language-independent (lexical): substring-matching (Aho-Corasick algorithm)
- language-dependent: extract (i) capitalized tokens, (ii) noun phrases, prepositional phrases and multi word units (MWU), and (iii) name entities
- candidate selection: resolve overlaps based on a preference-based choice (NE > ORG > LOC > MISC > NP > MWU > PP > lexical lookup > Capitalized Sequence)
Disambiguation
Based on a probabilistic model from Han and Sun (2011) which uses raw counts of phrases (s), contexts (c) and the entities (e) from Wikipedia for computing the probabilities P(e), P(s|e), P(c|e). Phrases with a lower probability than an artificial NIL entry are removed.
Technologies Used
- fast serialization: kryo
- substring search: LingPipe's Aho-Corasick implementation
- tokenization and named entity recognition: Apache OpenNLP