Big Data is good for answers, not for questions

Big data is good at predicting outcomes, but does a poorer job of explaining the causes that drive them. It can only provide answers for the right types of questions.

There has been a lot of buzz lately about big data and its ability to make relevant associations, reveal trends, and unlock hidden value. At the same time, it has been roundly criticized for being able to deliver facts but not explanations: it can tell you that something exists, but not why.

But does knowing why really matter?

Intuitively, it would seem so. If we understand why something happens, then we can either continue to do it if the results are good, or stop doing it if the consequences are bad. In many situations, however, understanding why matters less than we might think.

To explain, let’s use a golfing analogy. If, like me, you tend to slice the ball, you might decide to visit a golf pro to take lessons. The golf pro will observe how you hit the ball and try to fix your swing. In doing so, he is unlikely to say, "your slice comes from a fear of commitment that resulted from the insecurity you felt as a child"; nor will he tell you, "your slice is due to a variance in deltoid muscle mass between your dominant and non-dominant sides", despite the fact that both of these reasons may be true.

In fact, the cause of the slice is irrelevant. What matters is how to correct the problematic behaviour. You are more likely to hear him tell you something like this, "when you swing, rotate your hips and place 80 per cent of your weight on your left leg. Now, go and practice that 100 times." When it comes to a golf swing, causes are mostly irrelevant. What matters is changing behaviours.

Let’s look at another example: predicting the future prices of Bordeaux wines. According to a study by Princeton economist Orley Ashenfelter, these prices can be calculated using only three factors. The resulting predictions are much more accurate than, say, the opinions of Bordeaux wine experts or even the prices of those same wines when they are young. Not surprisingly, this does not make Ashenfelter very popular around oenophiles. They are quick to discount and deride his work, pointing out that his approach is like reviewing a movie without taking the time to watch it.

At this point, whether or not you are interested in Bordeaux wines, you are probably curious about the three factors that Ashenfelter identified. This curiosity is an embodiment of our natural desire to understand causes — to find out why. In reality, however, this information is unimportant; it only matters that the formula works. From a business perspective, it is only relevant to know that there are three factors (not two or four) and that they can be accurately collected and measured.

This is exactly the information that big data can provide.

An example of the power of big data to predict outcomes without the need to explain them can be found in movie box office results. It has long been known that the performance of a movie on its opening weekend is a very strong predictor of its performance across its total run. However, data from Google searches suggests that a movie’s performance can be accurately predicted even before it hits the screen. Google analysed the performance of 99 films released in 2012 and found that 70 per cent of the variation in box office performance could be explained by search query volumes. In fact, query volumes and other movie-related variables, like trailer views, were able to predict a film’s opening weekend performance with 92 per cent accuracy.

Our natural desire to ‘get the bottom of something’ or conduct ‘root-cause analysis’ is, in many cases, counterproductive. First, there is the time and resources that are wasted tracking down reasons and justifications. Second, even if we find a cause, the condition being studied may have changed, so that the cause is no longer relevant. In fast-moving sectors, this can be a real concern. Third, we may find the wrong cause and thus build incorrect assumptions that inaccurately drive future behaviour. Our common reliance on allocating blame to stereotypes is an example of this. Fourth, we may project a reason for something when none actually exists. The cause may just be random variation, or noise without a signal, and yet we create a signal all the same. In sum, our search for why can lead us down unproductive and erroneous paths.

I believe the largest big data-related challenge today has nothing to do with the issues we hear about most often, such as inaccurate data, poorly integrated systems, privacy concerns, and the like. These issues may be relevant, but they pale in comparison to the inability of management to trust the data, and modify behaviour in light of what the data shows. Leaders everywhere struggle to accept that when it comes to big data, it is frequently more productive to trust the formula, and forget the reason why.

Let’s look at a final example. When sifting through your company data, analytics might tell you that employees who are members of several social networks are more likely to quit than those who are members of two or fewer (recent research suggests that this is in fact true). You may be curious to know why this is the case.

Indeed, many possible reasons may pop to mind, such as the possibility that people who are active on multiple social networks might be more naturally gregarious, less loyal, or that they have access to more job opportunities though larger networks. You may be able to confirm one of these hypotheses with some investigation. But, in the end, it doesn’t matter. What matters is how you use the information, and this does not depend at all on the cause. For example, you could decide to make information about social network membership status a criterion for hiring. This information is easily available. Or you may factor it into job rotation or promotion decisions.

The key is to find the right formula and then trust what the data tells you, even when it contradicts your intuition or screams out for further analysis. Smart companies adjust their behaviour by acting on the best information available, understanding that time is of an essence. Lesser companies wait until they understand the why before they act. In today’s fast-paced world, this is a luxury that most of us can ill afford.

Addendum: In order to satisfy your natural (but superfluous) need to know why, the formula that Ashenfelter came up with to predict the future price of Bordeaux wine was as follows: 12.145 plus 0.00117 times the amount of winter rainfall (reason: winter rain tends to increase the yield without reducing the quality) plus 0.0614 times the average growing season temperature (reason: higher temperatures tend to lead to better wines) minus 0.00386 times the amount of harvest rainfall (reason: rainfall around harvest time tends to lead to rot). The result of this calculation has been remarkably robust and accurate over more than 50 years.

Michael Wade is professor of innovation and strategic information management at IMD. He is Program Director of  Orchestrating Winning Performance and teaches in IMD's  Breakthrough Program for Senior Executives.