# Sparse Machine Learning Methods for Understanding Large Text Corpora

by Ghaoui et al.

Sparse machine learning methods provide models that are easier to interpret by seeking a trade-off between goodness-of-fit and the sparsity of the results. The authors present a number of sparse machine learning methods and apply them to multi-document topic summarization.

## Dataset

All experiments are performed on the Aviation Safety Reporting System (ASRS) dataset (http://asrs.arc.nasa.gov) that contains the following crucial properties:- large scale (>100,000 documents) and growing rapidly (approx. 6150 reports in 2011)
- noisy data (abbreviations, orthographic and grammatical errors, shortcuts, ...)
- complex information need (it is not known in advance what to look for; no haystack/needle problem)

## Sparse Machine Learning Methods

### Classification and regression

LASSO is a variant of the least square algorithm that considers sparseness:\[min_{\beta} ||X^T \cdot\beta - y||^2_2 + \lambda ||\beta||_1\]

The $$l_1$$ norm penalty encourages the regression coefficient $$\beta$$ to be sparse which yields results that are easier to interpret.

### Principal component analysis

The sparse principal component analysis (Sparse PCA) is a variant of PCA that identifies sparse directions of high variance.### Sparse models versus thresholded models

- sparse models are build around the philosophy that sparsity should be part of the model's formulation using typically an $$l_1$$ penalty
- extensive research of the least square case shows that thresholded models (only consider the top results) are actually often sub-optimal

## Method

The authors use LASSO regression for topic summarization (provide a summary of a topic rather than an article). They create target and reference corpora and compute terms that are specific for the target corpus.## Evaluation

- Per category lists of most predictive features (terms) - compare: co-occurrence analysis
- Visualizations: sparse PCA plots (interesting; each direction contains a number of terms; arrange the categories alongside these artificial axes; the diameter of the point representing a category corresponds to the number of documents found in it).

### Suggestion: Computation of PCA plot axes:

Try all possible combinations (brute force) and use the one that maximizes the distance (d^2) between the categories.