Sparse Machine Learning Methods for Understanding Large Text Corpora
by Ghaoui et al.
Sparse machine learning methods provide models that are easier to interpret by seeking a trade-off between goodness-of-fit and the sparsity of the results. The authors present a number of sparse machine learning methods and apply them to multi-document topic summarization.
Dataset
All experiments are performed on the Aviation Safety Reporting System (ASRS) dataset (http://asrs.arc.nasa.gov) that contains the following crucial properties:- large scale (>100,000 documents) and growing rapidly (approx. 6150 reports in 2011)
- noisy data (abbreviations, orthographic and grammatical errors, shortcuts, ...)
- complex information need (it is not known in advance what to look for; no haystack/needle problem)
Sparse Machine Learning Methods
Classification and regression
LASSO is a variant of the least square algorithm that considers sparseness:\[min_{\beta} ||X^T \cdot\beta - y||^2_2 + \lambda ||\beta||_1\]
The $$l_1$$ norm penalty encourages the regression coefficient $$\beta$$ to be sparse which yields results that are easier to interpret.
Principal component analysis
The sparse principal component analysis (Sparse PCA) is a variant of PCA that identifies sparse directions of high variance.Sparse models versus thresholded models
- sparse models are build around the philosophy that sparsity should be part of the model's formulation using typically an $$l_1$$ penalty
- extensive research of the least square case shows that thresholded models (only consider the top results) are actually often sub-optimal
Method
The authors use LASSO regression for topic summarization (provide a summary of a topic rather than an article). They create target and reference corpora and compute terms that are specific for the target corpus.Evaluation
- Per category lists of most predictive features (terms) - compare: co-occurrence analysis
- Visualizations: sparse PCA plots (interesting; each direction contains a number of terms; arrange the categories alongside these artificial axes; the diameter of the point representing a category corresponds to the number of documents found in it).
Suggestion: Computation of PCA plot axes:
Try all possible combinations (brute force) and use the one that maximizes the distance (d^2) between the categories.