Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

Albert Weichselbraun

is a Professor of Information Science at the University of Applied of the Grisons.

Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora

1 minute read

by de Knijff et al.

This paper covers a framework that

extracts terms from Web corpora,
uses word sense disambiguation (WSD) to determine the word's senses, and
applies subsumption to arrange the extracted concepts in a hierarchy.

Methods for Word Sense Disambiguation

Resnik's similarity: computes the similarity between terms based on the degree of information they share
Jiang and Conrath's similarity measure: more accurate; takes into account the information content of

the lowest common subsumer, and
the terms themselves

SSI (Graph Connectivity Measure)

retrieve all possible senses from WordNet
select sense as $$sense_t = max_{s_i \in S_t} \sum_{c_j \in C_t} sim(s_i, c_j) $$ $$S_t$$ is the set of possible senses for term t and $$C_t$$ is the set of context senses.

Term Filtering

The authors select the most relevant terms by considering the following measures:

Domain pertinence - how relevant is a term for a domain \[ DP_{D_i}(t) = \frac{freq(t/D_i)}{max_j(freq(t/D_j))} \]

Lexical cohesion - cohesion among words (compare: significant phrase detection)
Domain consensus - judge how importance a term is based on the number of its occurrences
Structural relevance - add extra points for emphasized and title terms.

Data Sources & Metadata Formats

RePub (online paper repository)
RePEc
Taxonomy: SKOS meta data format

Share on

Twitter Facebook LinkedIn

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

less than 1 minute read

Integrating earth observation data with linked open data would pave the way for easy reuse and integration of these datasets. The article discusses how knowl...

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

less than 1 minute read

Career websites contain valuable data on employees, their skill sets and, employment history. This article uses k-means clustering on keywords describing ski...

Suffix array

1 minute read

The suffix array is a memory-efficient alternative to the suffix tree which provides a sorted list of string indices indicating the string’s suffixes.

Dynamic feature scaling for online learning of binary classifiers

less than 1 minute read

This article describes and evaluates different online feature scaling approaches and their impact on the performance of binary classifiers. online feature...