Improving Efficiency and Accuracy in Multilingual Entity Extraction

1 minute read

Daiber, J. et al., 2013. Improving Efficiency and Accuracy in Multilingual Entity Extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS ™13). pp. 121—124

Introduction

This paper focuses on the implementation and data processing challenges encountered while developing a faster and more accurate version of DBpedia Spotlight.

Method

Spotlight uses a two step approach towards extracting entities: (i) phrase spotting (i.e. recognition of phrases to be annotated) and (ii) disambiguation (i.e. entity linking against DBpedia).

Phrase Spotting

identify candidate phrases using one of the following two approaches:

language-independent (lexical): substring-matching (Aho-Corasick algorithm)
language-dependent: extract (i) capitalized tokens, (ii) noun phrases, prepositional phrases and multi word units (MWU), and (iii) name entities

candidate selection: resolve overlaps based on a preference-based choice (NE > ORG > LOC > MISC > NP > MWU > PP > lexical lookup > Capitalized Sequence)

Disambiguation

Based on a probabilistic model from Han and Sun (2011) which uses raw counts of phrases (s), contexts (c) and the entities (e) from Wikipedia for computing the probabilities P(e), P(s|e), P(c|e). Phrases with a lower probability than an artificial NIL entry are removed.

Technologies Used

fast serialization: kryo
substring search: LingPipe's Aho-Corasick implementation
tokenization and named entity recognition: Apache OpenNLP

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Improving Efficiency and Accuracy in Multilingual Entity Extraction

Introduction

Method

Phrase Spotting

Disambiguation

Technologies Used

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers