Big Data and Causality

Big Data and Causality
Stefano Canali, University College of London

Last December, the UK government announced the foundation of the Alan Turing Institute for Data Science “to benefit British companies to have an advantage in big data”; according to the government, big data is one of the “eight great technologies” of the future.

Why is big data considered so valuable? In the general discourse, this value is normally displayed in terms of the correlations we can find in large data sets and the predictions we can make as a result of the correlations. For instance, 75% of Netflix’s income comes from recommendations based on correlations. On this view, it is clearly stated that correlations are enough and searching for causal relations beyond correlations is not necessary: as Mayer-Schonberger and Cukier argue, causality “is being knocked off its pedestal as the primary fountain of meaning”.

However, neglecting causality might be premature and I think we actually need causality to extract the value of this “great technology”: in particular, causality can be useful when it comes to policy-making.

Why would causality be so important for policy-making? First of all, in very large data sets there is always the possibility of finding spurious correlations and thus making wrong predictions: causal knowledge could exclude spurious correlations. Moreover, policy- making is often about intervening and, when we want to intervene, correlations may not be enough. As a matter of fact, in a data set any variable is not only correlated with its causes, but also with its effects and the other effects of its causes. The point is that, when we want to intervene on a phenomenon, it is crucial to know which are its effects and which its causes, because intervening of the effects will not change the phenomenon, while intervening on the causes might. For instance, an health issue such as lung cancer may be something we want to change with a policy intervention. Here, correlations are not enough because in the data we will find that lung cancer is correlated both with its cause (smoking) and the effects of its cause (e.g. smoky clothes), but intervening on the effects (forbidding wearing smoky clothes at school) will not change the issue, while intervening on the cause (forbidding smoking at school) may.

Hence, causal knowledge may be useful to exploit the value of big data. But, can we really extract causal knowledge from big data? We may find possible insights by looking into scientific research relying on big data. For example, EXPOsOMICS is a new scientific project where scientists use big data to “predict individual disease risk related to the environment”; EXPOsOMICS is considered frontier research and funded by the European Union. What is interesting of EXPOsOMICS is that researchers try to make predictions based on big data, but they do not use correlations only: beyond correlations, they search for causal relations between the environment (e.g. pollution) and disease (e.g. asthma). By looking at the activities scientists carry out in order to find causality in big data sets, we might gain relevant insights for how to extract causal knowledge from big data: in particular, I would suggest that we should look at the curation practices that scientists carry out so that data is useful for their research. Investigating the ways in which scientists curate data and produce causal knowledge is relevant, as we may apply them to big data projects where causality can play an important role.

Stefano Canali
University College of London

University College of London