Sparse Machine Learning Methods for Understanding Large Text Corpora

1 minute read

by Ghaoui et al.

Sparse machine learning methods provide models that are easier to interpret by seeking a trade-off between goodness-of-fit and the sparsity of the results. The authors present a number of sparse machine learning methods and apply them to multi-document topic summarization.


All experiments are performed on the Aviation Safety Reporting System (ASRS) dataset ( that contains the following crucial properties:

  1. large scale (>100,000 documents) and growing rapidly (approx. 6150 reports in 2011)
  2. noisy data (abbreviations, orthographic and grammatical errors, shortcuts, ...)
  3. complex information need (it is not known in advance what to look for; no haystack/needle problem)

Sparse Machine Learning Methods

Classification and regression

LASSO is a variant of the least square algorithm that considers sparseness:

\[min_{\beta} ||X^T \cdot\beta - y||^2_2 + \lambda ||\beta||_1\]

The $$l_1$$ norm penalty encourages the regression coefficient $$\beta$$ to be sparse which yields results that are easier to interpret.

Principal component analysis

The sparse principal component analysis (Sparse PCA) is a variant of PCA that identifies sparse directions of high variance.

Sparse models versus thresholded models

  1. sparse models are build around the philosophy that sparsity should be part of the model's formulation using typically an $$l_1$$ penalty
  2. extensive research of the least square case shows that thresholded models (only consider the top results) are actually often sub-optimal


The authors use LASSO regression for topic summarization (provide a summary of a topic rather than an article). They create target and reference corpora and compute terms that are specific for the target corpus.


  1. Per category lists of most predictive features (terms) - compare: co-occurrence analysis
  2. Visualizations: sparse PCA plots (interesting; each direction contains a number of terms; arrange the categories alongside these artificial axes; the diameter of the point representing a category corresponds to the number of documents found in it).

Suggestion: Computation of PCA plot axes:

Try all possible combinations (brute force) and use the one that maximizes the distance (d^2) between the categories.