Sparse Machine Learning Methods for Understanding Large Text Corpora

1 minute read

by Ghaoui et al.

Sparse machine learning methods provide models that are easier to interpret by seeking a trade-off between goodness-of-fit and the sparsity of the results. The authors present a number of sparse machine learning methods and apply them to multi-document topic summarization.

Dataset

All experiments are performed on the Aviation Safety Reporting System (ASRS) dataset (http://asrs.arc.nasa.gov) that contains the following crucial properties:

large scale (>100,000 documents) and growing rapidly (approx. 6150 reports in 2011)
noisy data (abbreviations, orthographic and grammatical errors, shortcuts, ...)
complex information need (it is not known in advance what to look for; no haystack/needle problem)

Sparse Machine Learning Methods

Classification and regression

LASSO is a variant of the least square algorithm that considers sparseness:

\[min_{\beta} ||X^T \cdot\beta - y||^2_2 + \lambda ||\beta||_1\]

The $$l_1$$ norm penalty encourages the regression coefficient $$\beta$$ to be sparse which yields results that are easier to interpret.

Principal component analysis

The sparse principal component analysis (Sparse PCA) is a variant of PCA that identifies sparse directions of high variance.

Sparse models versus thresholded models

sparse models are build around the philosophy that sparsity should be part of the model's formulation using typically an $$l_1$$ penalty
extensive research of the least square case shows that thresholded models (only consider the top results) are actually often sub-optimal

Method

The authors use LASSO regression for topic summarization (provide a summary of a topic rather than an article). They create target and reference corpora and compute terms that are specific for the target corpus.

Evaluation

Per category lists of most predictive features (terms) - compare: co-occurrence analysis
Visualizations: sparse PCA plots (interesting; each direction contains a number of terms; arrange the categories alongside these artificial axes; the diameter of the point representing a category corresponds to the number of documents found in it).

Suggestion: Computation of PCA plot axes:

Try all possible combinations (brute force) and use the one that maximizes the distance (d^2) between the categories.

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Sparse Machine Learning Methods for Understanding Large Text Corpora

Dataset

Sparse Machine Learning Methods

Classification and regression

Principal component analysis

Sparse models versus thresholded models

Method

Evaluation

Suggestion: Computation of PCA plot axes:

Share on

You may also enjoy

Big, Linked Geospatial Data and Its Application in Earth Observation

Employment relations: a data driven analysis of job markets using online job boards and online professional networks

Suffix array

Dynamic feature scaling for online learning of binary classifiers