Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

1 minute read

by de Knijff et al.

This paper covers a framework that

  1. extracts terms from Web corpora,
  2. uses word sense disambiguation (WSD) to determine the word's senses, and
  3. applies subsumption to arrange the extracted concepts in a hierarchy.

Methods for Word Sense Disambiguation

  1. Resnik's similarity: computes the similarity between terms based on the degree of information they share
  2. Jiang and Conrath's similarity measure: more accurate; takes into account the information content of
    1. the lowest common subsumer, and
    2. the terms themselves

  3. SSI (Graph Connectivity Measure)
    1. retrieve all possible senses from WordNet
    2. select sense as $$sense_t = max_{s_i \in S_t} \sum_{c_j \in C_t} sim(s_i, c_j) $$ $$S_t$$ is the set of possible senses for term t and $$C_t$$ is the set of context senses.

Term Filtering

The authors select the most relevant terms by considering the following measures:

  1. Domain pertinence - how relevant is a term for a domain \[ DP_{D_i}(t) = \frac{freq(t/D_i)}{max_j(freq(t/D_j))} \]
  2. Lexical cohesion - cohesion among words (compare: significant phrase detection)
  3. Domain consensus - judge how importance a term is based on the number of its occurrences
  4. Structural relevance - add extra points for emphasized and title terms.

Data Sources & Metadata Formats

  • RePub (online paper repository)
  • RePEc
  • Taxonomy: SKOS meta data format