Data Science and Prediction

3 minute read

by Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64—73.

This article provides insights into how data science complements science in providing instruments for predicting the future.

Why not statistics?

In contrast to statistics which has been around for centuries, data science is not only concerned with structured but also unstructured data. It focuses on discovering patterns in data to obtain actionable insights.

Karl Popper evaluated models and theories based on their predictive power, an approach which implicitly favors simple theories (Occam's razor), or succinctness over more complex ones, since they are more likely to be robust on future data.

Machine learning automatically define patterns in data but it is also prone for picking up noise. The critical question is whether these patterns are robust and hence likely to hold for future data. Standard methods for testing models and patterns apply "out of sample" and "out of time" data to assess their robustness.

Knowledge discovery

George Box statement that "All models are wrong, but some more useful", and the observation by Dhar and Chou that "patterns emerge before reasons for them become apparent" lead to the question, why scientists should bother with developing detailed causal models for fields where they only yield poor predictions and likely get worse over time due to concept drift.

This is especially true for social sciences where theories often employ causality without serious consideration of their predictive power. Hastie et al (2009) identify the following three reasons for prediction errors:

  1. misspecification of the model (which is no longer an issue with big data since large amounts of data allow us to consider models with fewer assumptions),
  2. small samples and, therefore, greater biases (another problem addressed by big data), and
  3. randomness, even when the model is specified perfectly.
Dhar concludes that hypothesis-driven research has served us well, but that it does not scale well with the data volumes emanating around us these days.

Data scientists

Data scientists must combine multiple skills including

  1. statistics, such as Bayesian statistics, or multivariate analysis
  2. computer science for knowledge on the internal representation of data and its manipulation by computers,
  3. knowledge about correlation and causation, and
  4. the ability to formulate problems and the underlying often identical or isomorphic structure as demonstrated by Herbert Simon.

Societal changes that favor big data:

  1. cheap storage and growing data volumes: A 2011 McKinsey industry reports concluded that the worldwide volume of data is growing at a rate of approximately 50% per year, a trend which is also aided by the falling cost of storage. For instance, according to Dhar (2013) it is possible to store the world's entire stock of music on a $500 device.
  2. the move from intuition-based to fact-based decision making which is best expressed by the following quote from Edwards Demming: "In God we trust, everyone else please bring data"

Applications of big data

Today already computers make the majority of investment decisions, decide on the ads to display to individual customers, control areas such as air traffic control and many types of planing tasks. Paypal uses big data to predict the distribution of losses on a per transaction base and IBM's Watson is a showcase of technology which is able to predict the correct answers to "Jeopardy!" questions without a deep understanding of the questions asked.

Another application area is in politics. For example, the Democratic National Committee heavily invested into predictive models on the basis of results from large-scale experiments used to manipulate attitudes. The campaign predicted at the level of individual voters how they would most probably vote and how to turn voters in favor of the party.

Outlook

Combining machine learning with high quality human-curated data greatly amplifies the potential of machine learning approaches, as demonstrated by IBM's Watson and Googles Knowledge Graph. It is expected that this trend will grow in importance over the years to come.