Improving Efficiency and Accuracy in Multilingual Entity Extraction

1 minute read

Daiber, J. et al., 2013. Improving Efficiency and Accuracy in Multilingual Entity Extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS ™13). pp. 121—124


This paper focuses on the implementation and data processing challenges encountered while developing a faster and more accurate version of DBpedia Spotlight.


Spotlight uses a two step approach towards extracting entities: (i) phrase spotting (i.e. recognition of phrases to be annotated) and (ii) disambiguation (i.e. entity linking against DBpedia).

Phrase Spotting

  1. identify candidate phrases using one of the following two approaches:
    • language-independent (lexical): substring-matching (Aho-Corasick algorithm)
    • language-dependent: extract (i) capitalized tokens, (ii) noun phrases, prepositional phrases and multi word units (MWU), and (iii) name entities

  2. candidate selection: resolve overlaps based on a preference-based choice (NE > ORG > LOC > MISC > NP > MWU > PP > lexical lookup > Capitalized Sequence)


Based on a probabilistic model from Han and Sun (2011) which uses raw counts of phrases (s), contexts (c) and the entities (e) from Wikipedia for computing the probabilities P(e), P(s|e), P(c|e). Phrases with a lower probability than an artificial NIL entry are removed.

Technologies Used

  1. fast serialization: kryo
  2. substring search: LingPipe's Aho-Corasick implementation
  3. tokenization and named entity recognition: Apache OpenNLP