Data Science and Prediction
by Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64—73.
This article provides insights into how data science complements science in providing instruments for predicting the future.
Why not statistics?
In contrast to statistics which has been around for centuries, data science is not only concerned with structured but also unstructured data. It focuses on discovering patterns in data to obtain actionable insights.Karl Popper evaluated models and theories based on their predictive power, an approach which implicitly favors simple theories (Occam's razor), or succinctness over more complex ones, since they are more likely to be robust on future data.
Machine learning automatically define patterns in data but it is also prone for picking up noise. The critical question is whether these patterns are robust and hence likely to hold for future data. Standard methods for testing models and patterns apply "out of sample" and "out of time" data to assess their robustness.
Knowledge discovery
George Box statement that "All models are wrong, but some more useful", and the observation by Dhar and Chou that "patterns emerge before reasons for them become apparent" lead to the question, why scientists should bother with developing detailed causal models for fields where they only yield poor predictions and likely get worse over time due to concept drift.This is especially true for social sciences where theories often employ causality without serious consideration of their predictive power. Hastie et al (2009) identify the following three reasons for prediction errors:
- misspecification of the model (which is no longer an issue with big data since large amounts of data allow us to consider models with fewer assumptions),
- small samples and, therefore, greater biases (another problem addressed by big data), and
- randomness, even when the model is specified perfectly.
Data scientists
Data scientists must combine multiple skills including- statistics, such as Bayesian statistics, or multivariate analysis
- computer science for knowledge on the internal representation of data and its manipulation by computers,
- knowledge about correlation and causation, and
- the ability to formulate problems and the underlying often identical or isomorphic structure as demonstrated by Herbert Simon.
Societal changes that favor big data:
- cheap storage and growing data volumes: A 2011 McKinsey industry reports concluded that the worldwide volume of data is growing at a rate of approximately 50% per year, a trend which is also aided by the falling cost of storage. For instance, according to Dhar (2013) it is possible to store the world's entire stock of music on a $500 device.
- the move from intuition-based to fact-based decision making which is best expressed by the following quote from Edwards Demming: "In God we trust, everyone else please bring data"
Applications of big data
Today already computers make the majority of investment decisions, decide on the ads to display to individual customers, control areas such as air traffic control and many types of planing tasks. Paypal uses big data to predict the distribution of losses on a per transaction base and IBM's Watson is a showcase of technology which is able to predict the correct answers to "Jeopardy!" questions without a deep understanding of the questions asked.Another application area is in politics. For example, the Democratic National Committee heavily invested into predictive models on the basis of results from large-scale experiments used to manipulate attitudes. The campaign predicted at the level of individual voters how they would most probably vote and how to turn voters in favor of the party.