Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora
by de Knijff et al.
This paper covers a framework that
- extracts terms from Web corpora,
- uses word sense disambiguation (WSD) to determine the word's senses, and
- applies subsumption to arrange the extracted concepts in a hierarchy.
Methods for Word Sense Disambiguation
- Resnik's similarity: computes the similarity between terms based on the degree of information they share
- Jiang and Conrath's similarity measure: more accurate; takes into account the information content of
- the lowest common subsumer, and
- the terms themselves
- SSI (Graph Connectivity Measure)
- retrieve all possible senses from WordNet
- select sense as $$sense_t = max_{s_i \in S_t} \sum_{c_j \in C_t} sim(s_i, c_j) $$ $$S_t$$ is the set of possible senses for term t and $$C_t$$ is the set of context senses.
Term Filtering
The authors select the most relevant terms by considering the following measures:- Domain pertinence - how relevant is a term for a domain \[ DP_{D_i}(t) = \frac{freq(t/D_i)}{max_j(freq(t/D_j))} \]
- Lexical cohesion - cohesion among words (compare: significant phrase detection)
- Domain consensus - judge how importance a term is based on the number of its occurrences
- Structural relevance - add extra points for emphasized and title terms.