Mendes, P., Daiber, J., Rajapakse, R., Sasaki, F., & Bizer, C. (2012). Evaluating the Impact of Phrase Recognition on Concept Tagging. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC™12). Istanbul, Turkey
Introduction
This paper introduces DBpedia Spotlight - a concept tagging system which annotates entities, topics and other terms in text documents by (i) recognizing phrases, and (ii) linking them to DBpedia.
In contrast to related methods, concept tagging does not specify the annotation focus such as entity linking, automatic term recognition, and targeted word sense disambiguation but adapts to the style of annotation required for a particular task.
Phrase Recognition
This task provides candidate phrases for disambiguation and tagging. It's goal is to reduce the number of false positives (i.e. phrases which cannot or should not be grounded) without missing legitimate phrases. The paper distinguishes between the following approaches:
- Lexicon-based (i.e. String-matching) using the Aho-Corasick algorithm.
- Noun-phrase chunk heuristic (only allow phrases which contain at least one noun)
- Noun-phrase chunking with probabilistic dictionaries (use a NP chunker to extract noun phrases and select the longest ones which occur in the search dictionary which is encoded as a Bloom filter to reduce memory consumption).
- Detection of common words (i.e. words which should not be tagged) by using (i) a classifier for single word phrases and (ii) a classifier for n-word phrases.
- Keyphrase extaction using Kea (Frank et al., 1999)
- Named entity recognition to pre-tag potentially interesting phrases.
Disambiguation
Spotlight obtains a disambiguation accuracy of 82.54% by using a Vector Space Model which applies Inverse Candidate Frequencies (Mendes et al., 2011) for disambiguation.