### Association VS. Causal relationships

Association. As noted in the previous page, researchers are usually interested in relationships between variables. When two variables are related, we say that. A relationship, or correlation, in research broadly refers to any relationship between two or more variables. A causal relationship is a relationship between. This blog post looks at the different between causation and correlation.

A scientist in a dairy factory tries four different packaging materials for blocks of cheese and measures their shelf life. The packaging material might influence shelf life, but the shelf life cannot influence the packaging material used. The relationship is therefore causal. A bank manager is concerned with the number of customers whose accounts are overdrawn.

## What’s the difference between Causality and Correlation?

Half of the accounts that become overdrawn in one week are randomly selected and the manager telephones the customer to offer advice. Any difference between the mean account balances after two months of the overdrawn accounts that did and did not receive advice can be causally attributed to the phone calls. If two variables are causally related, it is possible to conclude that changes to the explanatory variable, X, will have a direct impact on Y.

Non-causal relationships Not all relationships are causal. In non-causal relationships, the relationship that is evident between the two variables is not completely the result of one variable directly affecting the other. In the most extreme case, Two variables can be related to each other without either variable directly affecting the values of the other.

- Introduction
- Learn everything about Analytics

The two diagrams below illustrate mechanisms that result in non-causal relationships between X and Y. For instance, if you want to know whether smoking contributes to stress, you need to make normal people smoke, which is ethically not possible. In that case, how do we establish causality using observational data? There has been good amount of research done on this particular issue. The entire objective of these methodologies is to eliminate the effect of any unobserved variable.

In this section, I will introduce you to some of these well known techniques: Panel Model Ordinary regression: This method comes in very handy if the unobserved dimension is invariant along at least one dimension.

For instance, if the unobserved dimension is invariant over time, we can try building a panel model which can segregate out the bias coming from unobserved dimension. But, because the unobserved dimension is invariant over time, we can simplify the equation as follows: We can now eliminate the unobserved factor by differencing over time Now, it becomes to find the actual coefficient of causality relationship between college and salary.

And then compare the response of this treatment among look alikes. This is the most common method implemented currently in the industry.

The look alike can be found using nearest neighbor algorithm, k-d tree or any other algorithm. One of them starts smoking and another does not.

Now the stress level can be compared over a period of time given no other condition changes among them. This is actually a topic for a different article in future. This is probably the hardest one which I find to implement. Following are the steps to implement this technique: Find the cause — effect pair. Find an attribute which is related to cause but is independent of the error which we get by regressing cause-effect pair.

This variable is known as Instrumental Variable. Now estimate the cause variables using IV.

**2 variable stats: Correlation and causation**

Try regressing estimated cause — effect to find the actual coefficient of causality. What have we done here? Using this methodology, we come out with an unbiased estimation.

### Elements of Research

Now, if we can find any information which is connected to cigarette consumption but not mental stress, we might be able to find the actual relationship. Generally IV are regulatory based variables. This is amongst one of my favourite choices. It this makes the observational data really close to experimental design. Suppose, we want to test the effect of scholarship in college on the grades by the end of course for students.

### Statistical Language - Correlation and Causation

Because these students are already bright, they might continue being on top in future as well. Hence, this is a very difficult cause-effect relation to crack! The assumption being that And the only thing which can change is the effect of scholarship. This is known as Quasi Randomized Selection. Hence, the results are very close to perfect conclusions on causality. The only challenge with this methodology is that getting such a dimension is very difficult which can give a pure break up between treated and non-treated population.

End Notes Establishing causality is probably the most difficult task in the field of analytics. The probability of getting it wrong is exceptionally high.

Key concepts discussed in this article will help you address the question of causality to a good extent. Just to end the article with some humor on the topic, here are a few images to drive the difference in correlation and causality.

Were you able to find the right cause-effect pairs given at the beginning of this article?